#### Archived

This topic is now archived and is closed to further replies.

# 'rdtsc' op-code. and QueryPerfromaceCounter()

This topic is 5506 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello. I'm trying to figure out whats up with all these timing functions that are provided. There are 3 main ones. timeGetTime(), GetTickCount(), and QueryPerformanceInterface(). Now I read over here that QueryPerformanceInterface() had the best execution speed, and gave more precise timing info. However I used this piece of asm from intel to test stuff out

union ticksInt
{
__int32 i32[2];
__int64 i64;
};
__int64 getTicks()
{
ticksInt a;
__asm
{
rdtsc
mov a.i32[0], eax
mov a.i32[4], edx
}
return a.i64;
}

And this is the program I used to test all the timing functions...


void main()
{

__int64 start, end, dif;
int testLoops = 10000;

start = getTicks();
cout<<"getTicks "<< (dif = getTicks() - start)<< endl;

start = getTicks();
for( int i = 0; i < testLoops; i++ )
{
GetTickCount();
}
end = getTicks();

cout<<"GetTickCount "<< (end - start - dif*2)/testLoops<< endl;

start = getTicks();
for( int i = 0; i < testLoops; i++ )
{
timeGetTime();
}
end = getTicks();

cout<<"timeGetTime "<< (end - start - dif*2)/testLoops<< endl;

LARGE_INTEGER li;
start = getTicks();
for( int i = 0; i < testLoops; i++ )
{
QueryPerformanceCounter(&li);
}
end = getTicks();
cout<<"QueryPerformanceCounter "<< (end - start - dif*2)/testLoops<< endl;

}

It turns out that QueryPerformanceInterface is the slowest of the bunch, am I doing something wrong? Also about that piece of asm code... 1: Will it work on all x86 cpus? I guess it will only work on pentium or higher since it needs a 64 bit value? how about amd, and cyrix...? 2: Does the asm take just 2 clock cycles to execute? Im not sure how to test how long the getTicks() function takes to execute. 3: The value returned by getTicks() is different then the value returned by QueryPerformanceInterface. Shouldnt they be the same since they both get their data from the high performace timer? 4: How do I find out how many ticks there are in a second when using the getTicks() function? 5: Whats the compatibility status of thet getTicks() function? ie: is it safe to assuem that this function would work on all/most systems? if not then how do I go about this? I know those are a lot of questions, but I've just started learning asm, and I always used to think that QueryPerformanceInterface was the best way to go, but now Im getting doubts... thanks for any input.
::: Al:::
[Triple Buffer V2.0] - Resource leak imminent. Memory status: Fragile [edited by - IFooBar on January 23, 2003 6:17:32 PM]

##### Share on other sites
QueryPerformanceCounter could be implemented by rtdsc, but most probably isn''t. Because there''s some problems with that.

First of all, rtdsc is a counter that is increased for every processor tick. That''s all good, if the cpu speed was stable. But it isn''t... unfortunately

Some mobile computers change the cpu speed on the fly. Some motherboards allow you to change the cpu speed inside windows, so that''s a possibility to cheat.

I have heard that future motherboards will have a high performance timer built in, so that''s good news for game programmers and gamers

##### Share on other sites
there is no such thing as QueryPerformanceInterface

##### Share on other sites
> It turns out that QueryPerformanceInterface is the slowest of the bunch, am I doing something wrong?
QueryPerformanceCounter uses either the PIT (1.193 MHz), or the PCI timer (3.580 MHz); both methods take several µs.

quote:
2: Does the asm take just 2 clock cycles to execute? Im not sure how to test how long the getTicks() function takes to execute.

~50 clocks on an Athlon (you need a serializing instruction such as cpuid to make sure the rdtsc instruction is executed when it should).

quote:
3: The value returned by getTicks() is different then the value returned by QueryPerformanceInterface. Shouldnt they be the same since they both get their data from the high performace timer?

Nope. You can get the perf. cnt. freq. with QueryPerformanceFrequency; the TSC is a clock counter.

quote:
4: How do I find out how many ticks there are in a second when using the getTicks() function?

t1 = getTicks();
/* wait e.g. 55 ms */
t2 = getTicks();
ticks_per_sec = (t2-t1) * (1000. / delay_ms)

quote:
1: Will it work on all x86 cpus? I guess it will only work on pentium or higher since it needs a 64 bit value? how about amd, and cyrix...?

quote:
5: Whats the compatibility status of thet getTicks() function? ie: is it safe to assuem that this function would work on all/most systems? if not then how do I go about this?

/* make sure CPUID is supported - either SEH, or toggle eflags.ID */
eax <- 1
cpuid
tsc_supported = edx[4]

Then again, it's a pretty safe assumption to make: it's been there 10 years now.

> Some mobile computers change the cpu speed on the fly. Some motherboards allow you to change the cpu speed inside windows, so that's a possibility to cheat. <
On top of that, the CPU clock jitters a bit, and you have problems on multiprocessor systems (CPU clocks may differ).

> I have heard that future motherboards will have a high performance timer built in, so that's good news for game programmers and gamers <
There is already one with a resolution of ~838 ns.

The lowdown:
Fastest timing method: read 10 ms counter at 0x7ffe000
Most accurate: rdtsc
easiest to use: GetTickCount
if(need high res)  if(single x86 desktop cpu)    rdtsc  else if(win only)    QPC  else if(pmode)    gettimeofday  else    PITelse if(med. res)  GetTickCount / glutGet(GLUT_ELAPSED_TIME) / SDL_GetTicks

HTH
Jan

/* edit: fixed code tag */

500x1

[edited by - Jan Wassenberg on January 23, 2003 10:29:14 PM]

##### Share on other sites
I would expect the same source clock to drive all the oscillators on the mother board. Whatever the dynamic range of the PCI bus is, the CPU clock ought to be identical.

Cheap VXCO''s are good to 25ppm, though I''m not sure what the stability of the CPU clock is. I would expect it to be 25ppm or less.

I thought QueryPerformanceCounter was extremely slow, I got 5us in my timings as well.

You can consider just using rdtsc for your timings. I think it works on all pentium class machines and greater (though it may be pentium II or greater). Very few gamers are using pentiums or K6-2''s today, and fewer still are using 486''s (dare I say 0).

If you want actual seconds from rdtsc, then you time the CPU using something like getTick, and spin like mad waiting for it to change. Once it changes, snap the rdtsc, then sleep for a little while, say 100ms. Then spin like mad waiting for getTick to change again, and snap rdtsc. This way you know your timings started right at the millisecond switch-over, and not somewhere inbetween. Now you have a rdtsc difference for a number of milliseconds (calculate it from the values returned by getTick), and voila, an accurate timer.

##### Share on other sites
The test is flawed. It is basically an invalid supposition. That being that you would select the method of measuring time based upon the performance of the call. Two of those methods have such low resolution that to get to a +/- 1% accuracy in the measurement you could only call it ten times a second. If a timer has a resolution of 0.001 then your accuracy is only +/- 100% when measuring a 0.001 second interval. You could have taken a reading at 0.0001 seconds and again at 0.0019 or at 0.0009 and again at 0.0011 seconds. One is a 0.0018 second interval and the other is .0002, but both register as 0.001. If you read the value until it changes to get more or less in sync then keep reading until it changes again and measure that interval you will most likely find it varies. So you may find that it changes after 0.0015 seconds one time and 0.0005 the next. It is extremely accurate in that it may only be off by 1 second in a year which makes it only off by 1 in 31.5 million seconds, but it may not be nearly as precise on an individual reading.

The point of that is that if you are calling it no more frequently than 10 times a second then how long the call takes makes virtually no differance. What, perhaps a microsecond? So what would that be? 0.01% of your overall processing time? Who cares? That is one fundamental problem with the test. The test provides you with trivia of no practical value. You do not know any better how to select the right timer for the task you are attempting after the test than you did before.

One fundamental mistake you made is that two back to back calls of a timer tells you how long ONE call takes. The time is read in the middle of the routine. The start of the interval is that read as is the end of the interval. The start of the interval is not the time at which the caller started setting up for the call and the end the time at which the caller stores the return value. So if you want a routine to measure itself then it is how long it takes to make two back to back calls.

A more valuable test with QueryPerformanceCounter and your getTicks would be can they measure one another and themselves. rdtsc certainly can. Does back to back calls to QueryPerformanceCounter always return differant values? If so then for both of them how much does the differance vary? If not then how many times can you call QueryPerformanceCounter before it changes. Just use a sequence of query, query, rdtsc, rdtsc, query, rdtsc. Record the query to query intervals and the rdtsc to rdtsc intervals. Then repeat the test and record the results for each test.

I think that will do a lot more toward providing you knowledge that is actually useful. Once you have a baseline with QueryPerformanceCounter and rdtsc you can use them to measure the other two. You don''t need a loop. It doesn''t take longer to read the value after it changed than to read the same value twice. The loop is overhead just like the calls to the timers. With 10k iterations the loop is greater overhead than the calls to the timers. Overall I would say make fewer assumptions and take more measurements. You seem to have assumed the loop is trivial and the call to the timers isn''t. I could be wrong, but my guess is that if you actually measure them you will find it is the other way around.

##### Share on other sites
quote:
Does back to back calls to QueryPerformanceCounter always return differant values? If so then for both of them how much does the differance vary?

back to back calls of QueryPerformanceCounter returns different values with a difference of about ~5.

I tried doing something like

start = getTicks();
// Wait for a second
end = getTicks();

tickspersec = end - start;

but tickspersecond veries a lot at different times. So either Im not doing teh second count right, or rdtsc does have it''s problems. Im using GetTickCount() to check for a second. Since GetTickCount() returns in milisecs i guess thats ok right.

Ok and I got rid of all teh for loops and QueryPerformanceCounter still takes the longest (in clock ticks) to execute. (about 3000 clock ticks in debug mode) while GetTickCount() takes ~55 clock ticks, and timeGetTime() takes ~2000 clock ticks. So AFAIK GetTickCount() has the lease overhead, but is also the least accurate (~1ms accuracy). timeGetTime() is supposed to have a solid accuracy of 1ms, and QueryPerformanceCounter() depends on the system so you''d have to use QueryPerformanceFrequency() to find that out.

quote:

The lowdown:
Fastest timing method: read 10 ms counter at 0x7ffe000
Most accurate: rdtsc
easiest to use: GetTickCount

Ok now this is confusing me. I thought "the CPU clock jitters a bit" so that would make rdtsc not accurate. and what is the "10 ms counter"? Do you mean a 10 milisecond counter for some sort? if so then 10 ms is too big of a timing gap to use to measure function calls.

quote:
If you want actual seconds from rdtsc, then you time the CPU using something like getTick, and spin like mad waiting for it to change. Once it changes... , , , ...a number of milliseconds (calculate it from the values returned by getTick), and voila, an accurate timer.

do you mean I have to enter a loop and wait for getTicks() to report a 0 count? and then start timing for a few millisecs? what do u mean by ''snap rdtsc''? alf is confused people!

thanks for the replies, a few things are still uncertain though...

:::Al:::
[Triple Buffer V2.0] - Resource leak imminent. Memory status: Fragile

##### Share on other sites
Magmai Kai Holmlor:

> I would expect the same source clock to drive all the oscillators on the mother board
They did it that way on the original PC; I don't think that's the case anymore.

rdtsc is available on Pentiums and K6-2 as well.

IFooBar:

> but tickspersecond veries a lot at different times.
Task switching and improperly measuring the delay time will mess up your clock count.

quote:
Ok now this is confusing me. I thought "the CPU clock jitters a bit" so that would make rdtsc not accurate. and what is the "10 ms counter"?

The length of a clock may drift, but that'll be blown away by the high resolution, as long as you're timing stuff that takes >> 1 clock.
Windows maps the tick count into your app's mem at 0x7ffe0000 (note: in units of 10 ms). Have a look at the GetTickCount code.

quote:
do you mean I have to enter a loop and wait for getTicks() to report a 0 count? and then start timing for a few millisecs? what do u mean by 'snap rdtsc'? alf is confused people!

He means you should wait until the start of a (windows system clock) tick when getting your time reference; your call to GetTickCount might come just after / before the next tick, throwing you off by 10 ms worst case.

/* edit: typo */

[edited by - Jan Wassenberg on January 24, 2003 11:55:49 AM]

##### Share on other sites
Check out this message from the LKML. Later messages in the thread elaborate on the rdtsc comment.

Regards,
Drew Vogel

##### Share on other sites

  #include <conio.h>#include <windows.h>#pragma warning(push)#pragma warning(disable: 4035) //just kidding about the returning value__forceinline __int64 cpu_ticks()	{	_asm{		cpuid;		rdtsc;      }	}#pragma warning(pop)double CPU_Speed_Hz()	{	DWORD dwInit_ms, dwStart_ms, dwEnd_ms;	volatile __int64 i64Start, i64Stop;	volatile __int64 i64Overhead;	i64Overhead = cpu_ticks() - cpu_ticks();	i64Overhead = cpu_ticks() - cpu_ticks();	i64Overhead = cpu_ticks() - cpu_ticks();		dwInit_ms = dwStart_ms = timeGetTime();	//Ensures we start on a fresh ms	while(dwInit_ms == dwStart_ms)		{		dwStart_ms = timeGetTime();		i64Start    = cpu_ticks();		}		while((timeGetTime() - dwStart_ms)<1000)		{		Sleep(10);		}	dwInit_ms = dwEnd_ms = timeGetTime();	//Ensures we end on a fresh ms	while(dwInit_ms == dwEnd_ms)		{		dwEnd_ms = timeGetTime();		i64Stop    = cpu_ticks();		}		__int64 i64Ticks   = i64Stop - i64Start + i64Overhead;	double dElapsed_ms = dwEnd_ms - dwStart_ms;	double dSpeed_Hz = static_cast<double>(i64Ticks / dElapsed_ms / 1000.00);	return(dSpeed_Hz);	}int main(int argc, char* argv[])	{	SetThreadPriority(GetCurrentThread() ,THREAD_PRIORITY_TIME_CRITICAL);	SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);		double Speed;	double Sum=0;	const int iterations = 10;	for(int i=0; i<iterations;i++)		{		Speed = CPU_Speed_Hz();		Sum += Speed;		cout <<"CPU Speed: "<< Speed <<" MHz"<<endl;		}	cout << "Average: " << Sum/iterations << endl;	getch();	return 0;	}

##### Share on other sites
OOh now i get it. Thanks for showing that code. I actually just realized the connection of the cpu speed with this (doh!).

Thanks for clearing everything up guys.

:::Al:::
[Triple Buffer V2.0] - Resource leak imminent. Memory status: Fragile

##### Share on other sites
Hi again. Magmai: I took your code and did a little reading at AMD and now this buffed up version does it pretty accurately. I thought Id post it here, since you guys were kind enough to help out

  #include <conio.h>#include <windows.h>#include <iostream>#pragma comment( lib, "winmm.lib" )unsigned int CPUSpeed();inline __int64 GetTicks();union ticksInt{	__int32 i32[2];	__int64 i64;};using namespace std;void main(){	const DWORD lastPriority = GetPriorityClass(GetCurrentThread());	const int lastThreadPriority = GetThreadPriority(GetCurrentThread());	SetThreadPriority(GetCurrentThread() ,THREAD_PRIORITY_TIME_CRITICAL);		SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);			unsigned int Speed;		double Sum=0;		const int iterations = 10;		for(int i=0; i<iterations;i++)			{				Speed = CPUSpeed();		Sum += Speed;				cout <<"CPU Speed: "<< Speed <<" Hz\n";			}			SetThreadPriority(GetCurrentThread() ,lastPriority);		SetPriorityClass(GetCurrentProcess(), lastThreadPriority);			cout << "\nAverage: " << (Sum/iterations)/1000000.0 <<" MHz\n";	getch();	}__int64 GetTicks(){	ticksInt a;	__asm	{		rdtsc		mov a.i32[0], eax		mov a.i32[4], edx	}	return a.i64;}unsigned int CPUSpeed()	{		int timeStart, timeStop;	__int64 startTick, endTick, overhead;	overhead = GetTicks()-GetTicks();	overhead = GetTicks()-GetTicks();	overhead = GetTicks()-GetTicks();	timeStart = timeGetTime();	while( timeGetTime() == timeStart ) timeStart = timeGetTime();		while(1)	{		timeStop = timeGetTime();				if ( (timeStop-timeStart) > 1 )			{			startTick = GetTicks();			break;		}	}	timeStart = timeStop;	while(1)	{		timeStop = timeGetTime();				if ( (timeStop-timeStart) > 1000 )			{			endTick = GetTicks();			break;		}	}	return (unsigned int)((endTick - startTick)+overhead);}

I *think* that it gets the speed very accurately now. Would putting the #pragma pop/push make much of a difference in the above code?

:::Al:::
[Triple Buffer V2.0] - Resource leak imminent. Memory status: Fragile