Why does this benchmark behave this way?

Started by
28 comments, last by swiftcoder 9 years, 2 months ago

This loop in Release x64 Maximize Speed (/O2)


000000013F2910D0 48 8D 4C 24 60       lea         rcx,[s1]  
000000013F2910D5 BB A0 86 01 00       mov         ebx,186A0h  
000000013F2910DA FF 15 20 0F 00 00    call        qword ptr [__imp_QueryPerformanceCounter (13F292000h)]  
// to run the test 100000 times
		while (i>0) {
				rez=(void **)MPFind(&map1,(void*)0);
000000013F2910E0 48 8D 4C 24 68       lea         rcx,[map1]  
000000013F2910E5 33 D2                xor         edx,edx  
000000013F2910E7 FF 15 13 1F 00 00    call        qword ptr [MPFind (13F293000h)]  
				i--;
000000013F2910ED FF CB                dec         ebx  
000000013F2910EF 48 89 44 24 28       mov         qword ptr [rez],rax  
000000013F2910F4 85 DB                test        ebx,ebx  
000000013F2910F6 7F E8                jg          main+70h (13F2910E0h)  
		}
		QueryPerformanceCounter(&s2);
000000013F2910F8 48 8D 4C 24 20       lea         rcx,[s2]  
000000013F2910FD FF 15 FD 0E 00 00    call        qword ptr [__imp_QueryPerformanceCounter (13F292000h)]  

with a result: 47.070000 46.920000 46.630000 47.230000 46.080000 45.990000 46.500000 46.510000 46.800000 46.720000

And this remove loop in Release x64 Maximize Speed (/O2)


		QueryPerformanceCounter(&s1);
000000013F7110C3 48 8D 4C 24 60       lea         rcx,[s1]  
000000013F7110C8 FF 15 32 0F 00 00    call        qword ptr [__imp_QueryPerformanceCounter (13F712000h)]  
// to run the test 100000 times
		while (i>0) {
				rez=&map1;//(void **)MPFind(&map1,(void*)0);
000000013F7110CE 4C 8D 5C 24 68       lea         r11,[map1]  
				i--;
		}
		QueryPerformanceCounter(&s2);
000000013F7110D3 48 8D 4C 24 20       lea         rcx,[s2]  
000000013F7110D8 4C 89 5C 24 28       mov         qword ptr [rez],r11  
000000013F7110DD FF 15 1D 0F 00 00    call        qword ptr [__imp_QueryPerformanceCounter (13F712000h)]

with a result: 0.000000 0.000000 0.000000 0.001000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

Advertisement
Okay, I'll try to give a bit more since it seems you didn't do the research to answer what I asked above..



What's "wrong" is that performance timings don't work that way, that the time the CPU spends doing work is not directly related to the opcodes, and that this has been the case for over two decades.

Many years ago, pre-1982, you could do timings the way you are attempting. Each opcode took a certain number of cycles to process, the data was processed, and the CPU moved another step.

Then home computer processors started using data caches and instruction caches and timings became a range. If the data or the instructions were already in the cache it could take a short number of cycles, or if it needed to go fetch the data it could take a large number of cycles.

More levels of cache were added, and timing ranges got worse and worse. The same operation could take anywhere from a single cycle to a large number of cycles depending on the cache.

Then Intel introduced a crazy idea that people thought would either fail horribly or win spectacularly. They built something called the "out of order core". The processor could consume several instructions at once, would break that down into sub-tasks and work on many different tasks at once, and then reassemble them and spit the work out all together. The results were amazing. Many instructions could potentially be done in parallel, bringing the cost down to an effective zero cycles. Most instructions could be done somewhat in parallel, so the speed depends on what instructions are around it. A very small number of instructions became more costly, but these were generally expensive already and avoided where possible.

Additional details were added to the OOO core, branch prediction and preprocessing and cache prefetching and much more.

The fact that your timing measurements show less time than the total number of opcodes run in the loop is a good thing. It means the hardware is smarter than you.

Even though you have some horrible things in your code, like turning your 64-bit values into floating point numbers and then converting them back, or telling the compiler to not inline a function that absolutely should be inlined to avoid the completely unnecessary cost of a function call, the processor is still awesome enough to overcome your piss-poor code and give better performance than you have any right to expect.

What's "wrong" is that performance timings don't work that way, that the time the CPU spends doing work is not directly related to the opcodes, and that this has been the case for over two decades.

Don't fully understand the written.

It is not about how you can run the code in "C" how many cycles and how to get the most effective opcodes. With full functionality.

And it's not that the code can be effectively optimized by the compiler. But the fact that people confuse these events with the real situation.

The results were amazing. Many instructions could potentially be done in parallel, bringing the cost down to an effective zero cycles. Most instructions could be done somewhat in parallel, so the speed depends on what instructions are around it.

Such statements can say that you have no idea what exactly is going on in the processor and may occur during its operation on a single core. The processor Core i have no possibility to parallelize the linear code, the results of each instruction which is dependent on previous instructions. And a tech is not able to share this code between processor cores. He can't see the future without calculations. Or the new processor is able to guess the cards? :)

Even though you have some horrible things in your code, like turning your 64-bit values into floating point numbers and then converting them back, or telling the compiler to not inline a function that absolutely should be inlined to avoid the completely unnecessary cost of a function call, the processor is still awesome enough to overcome your piss-poor code and give better performance than you have any right to expect.

It's all very indirectly relates to the question posed, and can see to justify a wrong answer to the wrong question.

Yes we can use inline and in fact the compiler is able to deploy a fully functional method. And the result can be 40 cysles. Or to boil it down to one simple constant calculation result and we get 1 cysles or 0.

But to assert that the correct result at 150 cysles. Given that there are a large number of memory accesses. While the result for 48 cycles shows us the difference when running the same algorithm in registers and that it is not correct - this is misleading. Misleading to say that the result in this particular example, giving 48 cycles is the result of optimizatsii compiler to constant calculation. Impossible to talk about the fact that the sample obtained with the remove loop and gives the result to 4800000 cycles

I can give another example where the search is performed not one but a number of values to exclude the possibility of prediction and to impede the search.
and the result is still less than 150.



#ifdef _MSC_VER
#include <windows.h>
#endif
#include <stdio.h>
#define max_count_calc 100000

#define _voidint long long

#if defined(_M_IX86) && !defined(CPU_X86_64)
#define CPU_X86_32
#endif

#ifdef _MSC_VER
#ifndef _NOINLINE
#define _NOINLINE __declspec(noinline)
#endif
#endif

#ifdef __GNUC__
#ifndef __MINGW32__
#ifndef _NOINLINE
#define _NOINLINE __attribute__((noinline))
#endif
#endif
#endif

struct Element {
    void *key;
    void *val;
};

typedef void ** pointer;

// test example binary search
pointer _NOINLINE findVoidSortMap(void ** list,void *key)
{
    if (!list) return 0;
    if (!*list) return 0;
    unsigned int count= **(unsigned int**)list;
    char *p=(char*)*list;
    p+=4;
    Element *b=(Element *)p;

    _voidint skey=(_voidint)key;

	while (count>0) {
		void** kt=(void**)&b[count>>1];
		_voidint rkey=(_voidint)kt[0];
		if (skey==rkey) return (void**)&kt[1];
		if (skey>rkey) {b+=(count>>1)+1;count--;}
		count=count>>1;
	}

	return 0;
}

#ifdef _MSC_VER
#ifdef CPU_X86_32
long long _NOINLINE _TSCDCPU() 
{ 
  _asm rdtsc;
}
#else
long long _NOINLINE _TSCDCPU() 
{ 
  return __rdtsc(); 
}
#endif
#endif

#ifdef __GNUC__
#ifndef __MINGW32__
long long _NOINLINE _TSCDCPU() 
{ 
  asm ( " rdtsc " );
}
#endif
#endif

typedef void* (*tfSTDCALL_p_FUNC_p_p)(void*,void*);

tfSTDCALL_p_FUNC_p_p MPFind=(tfSTDCALL_p_FUNC_p_p)&findVoidSortMap;

int main(int argc, char *argv[])
{
// allok array
    void *map1=(void *)new char[sizeof(int)+sizeof(void*)*2*1024];
	char *p=(char *)map1;
// to set the number of elements equal to 1024
	int *c=(int *)p;
	*c=1024;
// fill in the elements of array key:values 0:0 10:1 20:2 30:3..
	p+=sizeof(int);
	for (int i=0;i<1024;i++) {
		void**key=(void**)p;
		*key=(void*)(i*10);
		p+=sizeof(void*);
		void**val=(void**)p;
		*val=(void*)i;
		p+=sizeof(void*);
	}
// values for finding
	void **rezvalues=(void **)new char[sizeof(void*)*1024];
	void **findvalues=(void **)new char[sizeof(void*)*1024];
	for (int i=0;i<1024;i++) {
		findvalues[i]=(void*)(i*10);
	}

	void **rez;

	for (int j=0;j<10;j++) {
		int i=max_count_calc;
		LARGE_INTEGER s1;
		LARGE_INTEGER s2;
		void** rezval=rezvalues;
		void** val=findvalues;
		void** max=findvalues;
		max+=1024;
		QueryPerformanceCounter(&s1);
// to run the test 100000 times
		while (i>0) {
			*rezval=(void **)MPFind(&map1,(void*)*val);
			i--;
			val++;
			rezval++;
			if (val>=max) {
				val=findvalues;
				rezval=rezvalues;
			}
		}
		QueryPerformanceCounter(&s2);
// to calculate the average result for a single test
		s1.QuadPart=(s2.QuadPart-s1.QuadPart);
		double s3=s1.QuadPart*(1000.0/max_count_calc);
		printf("%lf \n",s3);
	}

	printf( "%i", **(int **)rezvalues+1 );

	scanf( "%d", &rez );

	delete[] (char*)rezvalues;
	delete[] (char*)findvalues;
	delete[] (char*)map1;

	return 0;
}

#ifdef _MSC_VER
int __stdcall WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, INT)
{
	return main(0,0);
}

#endif

120.220000 116.510000 113.500000 115.470000 115.960000 119.300000 119.410000 119.130000 119.450000 119.420000

But you need to take into account the fact that there is a reference to the data array
and if I will do a test with Disabled (/Od) the result will be:
249.760000 253.250000 249.150000 250.800000 247.830000 254.040000 252.100000 248.670000 249.330000 239.590000
It is an objective result?


Such statements can say that you have no idea what exactly is going on in the processor and may occur during its operation on a single core. The processor Core i have no possibility to parallelize the linear code, the results of each instruction which is dependent on previous instructions. And a tech is not able to share this code between processor cores. He can't see the future without calculations. Or the new processor is able to guess the cards?

No magic needed, just a bit of data flow analysis. That's what out-of-order execution is all about.

Modern CPUs *do* in fact guess the future. This is called speculative execution.

Counting cycles is useless nowadays. If you want to benchmark something, write a sufficiently complex program and measure actual time, not cycles.

As to the rest of your post... I have absolutely no idea what you're saying or asking.

current project: Roa

These facts are not relevant to the question, and only add confusion to the fact that:
Why do people believe in the priority over the correct result (cycles or time) Disabled (/Od) like this:


// test example binary search
pointer _NOINLINE findVoidSortMap(void ** list,void *key)
{
000000013F1B1000 48 89 54 24 10       mov         qword ptr [rsp+10h],rdx  
000000013F1B1005 48 89 4C 24 08       mov         qword ptr [rsp+8],rcx  
000000013F1B100A 48 83 EC 38          sub         rsp,38h  
    if (!list) return 0;
000000013F1B100E 48 83 7C 24 40 00    cmp         qword ptr [list],0  
000000013F1B1014 75 07                jne         findVoidSortMap+1Dh (13F1B101Dh)  
000000013F1B1016 33 C0                xor         eax,eax  
000000013F1B1018 E9 DB 00 00 00       jmp         findVoidSortMap+0F8h (13F1B10F8h)  
    if (!*list) return 0;
000000013F1B101D 48 8B 44 24 40       mov         rax,qword ptr [list]  
000000013F1B1022 48 83 38 00          cmp         qword ptr [rax],0  
000000013F1B1026 75 07                jne         findVoidSortMap+2Fh (13F1B102Fh)  
000000013F1B1028 33 C0                xor         eax,eax  
000000013F1B102A E9 C9 00 00 00       jmp         findVoidSortMap+0F8h (13F1B10F8h)  
    unsigned int count= **(unsigned int**)list;
000000013F1B102F 48 8B 44 24 40       mov         rax,qword ptr [list]  
000000013F1B1034 48 8B 00             mov         rax,qword ptr [rax]  
000000013F1B1037 8B 00                mov         eax,dword ptr [rax]  
000000013F1B1039 89 44 24 10          mov         dword ptr [count],eax  
    char *p=(char*)*list;
000000013F1B103D 48 8B 44 24 40       mov         rax,qword ptr [list]  
000000013F1B1042 48 8B 00             mov         rax,qword ptr [rax]  
000000013F1B1045 48 89 04 24          mov         qword ptr [rsp],rax  
    p+=4;
000000013F1B1049 48 8B 04 24          mov         rax,qword ptr [rsp]  
000000013F1B104D 48 83 C0 04          add         rax,4  
000000013F1B1051 48 89 04 24          mov         qword ptr [rsp],rax  
    Element *b=(Element *)p;
000000013F1B1055 48 8B 04 24          mov         rax,qword ptr [rsp]  
000000013F1B1059 48 89 44 24 18       mov         qword ptr [b],rax  

    _voidint skey=(_voidint)key;
000000013F1B105E 48 8B 44 24 48       mov         rax,qword ptr [key]  
000000013F1B1063 48 89 44 24 08       mov         qword ptr [skey],rax  

	while (count>0) {
000000013F1B1068 83 7C 24 10 00       cmp         dword ptr [count],0  
000000013F1B106D 0F 86 83 00 00 00    jbe         findVoidSortMap+0F6h (13F1B10F6h)  
		void** kt=(void**)&b[count>>1];
000000013F1B1073 8B 44 24 10          mov         eax,dword ptr [count]  
000000013F1B1077 D1 E8                shr         eax,1  
000000013F1B1079 8B C0                mov         eax,eax  
000000013F1B107B 48 6B C0 10          imul        rax,rax,10h  
000000013F1B107F 48 8B 4C 24 18       mov         rcx,qword ptr [b]  
000000013F1B1084 48 03 C8             add         rcx,rax  
000000013F1B1087 48 8B C1             mov         rax,rcx  
000000013F1B108A 48 89 44 24 28       mov         qword ptr [kt],rax  
		_voidint rkey=(_voidint)kt[0];
000000013F1B108F 48 8B 44 24 28       mov         rax,qword ptr [kt]  
000000013F1B1094 48 8B 00             mov         rax,qword ptr [rax]  
000000013F1B1097 48 89 44 24 20       mov         qword ptr [rkey],rax  
		if (skey==rkey) return (void**)&kt[1];
000000013F1B109C 48 8B 44 24 20       mov         rax,qword ptr [rkey]  
000000013F1B10A1 48 39 44 24 08       cmp         qword ptr [skey],rax  
000000013F1B10A6 75 0B                jne         findVoidSortMap+0B3h (13F1B10B3h)  
000000013F1B10A8 48 8B 44 24 28       mov         rax,qword ptr [kt]  
000000013F1B10AD 48 83 C0 08          add         rax,8  
000000013F1B10B1 EB 45                jmp         findVoidSortMap+0F8h (13F1B10F8h)  
		if (skey>rkey) {b+=(count>>1)+1;count--;}
000000013F1B10B3 48 8B 44 24 20       mov         rax,qword ptr [rkey]  
000000013F1B10B8 48 39 44 24 08       cmp         qword ptr [skey],rax  
000000013F1B10BD 7E 28                jle         findVoidSortMap+0E7h (13F1B10E7h)  
000000013F1B10BF 8B 44 24 10          mov         eax,dword ptr [count]  
000000013F1B10C3 D1 E8                shr         eax,1  
000000013F1B10C5 FF C0                inc         eax  
000000013F1B10C7 8B C0                mov         eax,eax  
000000013F1B10C9 48 6B C0 10          imul        rax,rax,10h  
000000013F1B10CD 48 8B 4C 24 18       mov         rcx,qword ptr [b]  
000000013F1B10D2 48 03 C8             add         rcx,rax  
000000013F1B10D5 48 8B C1             mov         rax,rcx  
000000013F1B10D8 48 89 44 24 18       mov         qword ptr [b],rax  
000000013F1B10DD 8B 44 24 10          mov         eax,dword ptr [count]  
000000013F1B10E1 FF C8                dec         eax  
000000013F1B10E3 89 44 24 10          mov         dword ptr [count],eax  
		count=count>>1;
000000013F1B10E7 8B 44 24 10          mov         eax,dword ptr [count]  
000000013F1B10EB D1 E8                shr         eax,1  
000000013F1B10ED 89 44 24 10          mov         dword ptr [count],eax  
	}
000000013F1B10F1 E9 72 FF FF FF       jmp         findVoidSortMap+68h (13F1B1068h)  

	return 0;
000000013F1B10F6 33 C0                xor         eax,eax  
}
000000013F1B10F8 48 83 C4 38          add         rsp,38h  
000000013F1B10FC C3                   ret  

Which gives 150 in the first embodiment and the second 250.

Excusing the behavior of this algorithm Maximize Speed (/O2):


// test example binary search
pointer _NOINLINE findVoidSortMap(void ** list,void *key)
{
000000013F6C1000 4C 8B CA             mov         r9,rdx  
    if (!list) return 0;
000000013F6C1003 48 85 C9             test        rcx,rcx  
000000013F6C1006 74 41                je          findVoidSortMap+49h (13F6C1049h)  
    if (!*list) return 0;
000000013F6C1008 48 8B 11             mov         rdx,qword ptr [rcx]  
000000013F6C100B 48 85 D2             test        rdx,rdx  
000000013F6C100E 74 39                je          findVoidSortMap+49h (13F6C1049h)  
    unsigned int count= **(unsigned int**)list;
000000013F6C1010 8B 0A                mov         ecx,dword ptr [rdx]  
    char *p=(char*)*list;
    p+=4;
000000013F6C1012 48 83 C2 04          add         rdx,4  
    Element *b=(Element *)p;

    _voidint skey=(_voidint)key;

	while (count>0) {
000000013F6C1016 85 C9                test        ecx,ecx  
000000013F6C1018 74 2F                je          findVoidSortMap+49h (13F6C1049h)  
000000013F6C101A 66 0F 1F 44 00 00    nop         word ptr [rax+rax]  
		void** kt=(void**)&b[count>>1];
000000013F6C1020 8B C1                mov         eax,ecx  
000000013F6C1022 48 D1 E8             shr         rax,1  
000000013F6C1025 48 C1 E0 04          shl         rax,4  
000000013F6C1029 48 03 C2             add         rax,rdx  
		_voidint rkey=(_voidint)kt[0];
000000013F6C102C 4C 8B 00             mov         r8,qword ptr [rax]  
		if (skey==rkey) return (void**)&kt[1];
000000013F6C102F 4D 3B C8             cmp         r9,r8  
000000013F6C1032 74 18                je          findVoidSortMap+4Ch (13F6C104Ch)  
		if (skey>rkey) {b+=(count>>1)+1;count--;}
000000013F6C1034 7E 0F                jle         findVoidSortMap+45h (13F6C1045h)  
000000013F6C1036 8B C1                mov         eax,ecx  
000000013F6C1038 D1 E8                shr         eax,1  
000000013F6C103A FF C0                inc         eax  
000000013F6C103C 48 C1 E0 04          shl         rax,4  
000000013F6C1040 48 03 D0             add         rdx,rax  
000000013F6C1043 FF C9                dec         ecx  
		count=count>>1;
000000013F6C1045 D1 E9                shr         ecx,1  
    Element *b=(Element *)p;

    _voidint skey=(_voidint)key;

	while (count>0) {
000000013F6C1047 75 D7                jne         findVoidSortMap+20h (13F6C1020h)  
	}

	return 0;
000000013F6C1049 33 C0                xor         eax,eax  
}
000000013F6C104B C3                   ret  
		if (skey==rkey) return (void**)&kt[1];
000000013F6C104C 48 83 C0 08          add         rax,8  
}
000000013F6C1050 C3                   ret  

Which gives 48, 113. As someone the behavior of the compiler does not give an idea of the work of the full code

If I had calculated the cycles as cycles and not as factors by which to compare performance, and the question was about why the same opcode different performance all your facts would be appropriate, but we are talking about a completely different algorithms.

You understand that simply refuse to hear me.

Or do you think that Maximize Speed (/O2) findVoidSortMap algorithm could be performed for 0 cycles?

Do you really think that something similar can be done for 1 cycles?


000000013F6C1036 8B C1                mov         eax,ecx  
000000013F6C1038 D1 E8                shr         eax,1  
000000013F6C103A FF C0                inc         eax  
000000013F6C103C 48 C1 E0 04          shl         rax,4  
000000013F6C1040 48 03 D0             add         rdx,rax 
I'm not sure anyone completely understands what you are asking?

Are you asking why the Intel processors give different results to the Elbrus-2C+(E2K)? If so that'll be down to the internal design of the Intel chips vs the Elbrus. (Remember: the assembly you see is NOT what the CPU executes in the ALU units - while the assembly might look like a CISC under the hood Intel CPUs have been converting to RISC for some time now.)

If you are asking why /Od gives different results to /O2, well look at the assembly output - the /O2 is clearly doing less work instruction wise and the processor is doing the rest.

Simply put the superscalar out-of-order architecture of the Intel processor is better than the Elbrus's architecture which is why it can do the same work in less time.

If you are asking something else then you need to be clearer in your intent.

You understand that only such combinations can be run in parallel:


000000013F6C1040 48 03 D0             add         rdx,rax  
000000013F6C1043 FF C9                dec         ecx  

000000013F6C1010 8B 0A                mov         ecx,dword ptr [rdx]  
000000013F6C1012 48 83 C2 04          add         rdx,4 

in the second case, the value read for the first and second commands from the virtual register edx are executed in parallel, and record the value for the virtual edx may be in another real case. This out-of-order execution ends, and then we go back in the 80's. The maximum that we can derive from this theory that micro operations in one command instead of 5-4 cycles will be executed for 1 each due to additional conveyors.

This topic is closed to new replies.

Advertisement