Jump to content
  • Advertisement

foxes

Member
  • Content Count

    22
  • Joined

  • Last visited

Community Reputation

98 Neutral

About foxes

  • Rank
    Member

Personal Information

  1. I mean what is in your mind, there is clearly something wrong, and that it going on I do not know   I'm not talking about a single instruction and about the whole algorithm, and it is quite possible to count the number of cycles entirely. ok?   Nobody in i7 artificial intelligence did not put that each time he gave different results for the short code. In addition compression means 40 instructions , and with a loop to 0 cycles.   If you are one of those who read the tea leaves as these instructions can interleaved and what can be 0 or maybe 50 it is deceptive.   I presented quite clear conditions for which there is a stable result.   I do not need those answers I already know them, sorry I've encountered enough people explaining to me how to walk on earth and how to open your mouth, and showing the various documents which I have studied, but the question relates to how this knowledge is used and I really see that the majority is wrong. If you communicate with the developers in principle, it is enough to understand that most of what they write is warnings and is not a dogma. That is all determined by the conditions of the situation which you can capture and study
  2. You talk a lot of theory, but no one tells us what is really going on, because you have about this is only a theoretical idea
  3. I didn't say not a word about magic, I'm merely stating that for any action processor takes time, and if he chooses a certain path calculations, he needs to spend the time or will lose it if there is no correct choice. And above all, if something gives a more efficient algorithm that does not mean that it is not possible to track.
  4. As I have already said it is necessary to perform a multiplication by a constant floating point instead of integer division. What is more quickly
  5. Right, and now look at the code under test, and surprised the next statement (like loop removed), which either do not have a testing example. This may only apply to individual statements and not the entire method.
  6.   I perfectly understand it. But none of these commands cannot be placed in one command block RISC Core i. Except for some mini operations as "inc" or "shr/shl" which is performed in the device forward and can be carried out together with the ordinary operation on a single value. But even on RISC commands such as "inc" or "shr/shl" can't run in the same command block with the same value. 000000013F6C1036 8B C1 mov eax,ecx 000000013F6C1038 D1 E8 shr eax,1 000000013F6C103A FF C0 inc eax 000000013F6C103C 48 C1 E0 04 shl rax,4 000000013F6C1040 48 03 D0 add rdx,rax
  7. phantom no I'm just talking about Intel. Question Elbrus concerns indirectly. Just because the developers claim that I underestimate the time for Intel. And what I'm doing tests on incorrect code that throws and simplifies the compiler. But this is not true.
  8. You understand that only such combinations can be run in parallel: 000000013F6C1040 48 03 D0 add rdx,rax 000000013F6C1043 FF C9 dec ecx 000000013F6C1010 8B 0A mov ecx,dword ptr [rdx] 000000013F6C1012 48 83 C2 04 add rdx,4 in the second case, the value read for the first and second commands from the virtual register edx are executed in parallel, and record the value for the virtual edx may be in another real case. This out-of-order execution ends, and then we go back in the 80's. The maximum that we can derive from this theory that micro operations in one command instead of 5-4 cycles will be executed for 1 each due to additional conveyors.
  9. Or do you think that Maximize Speed (/O2) findVoidSortMap algorithm could be performed for 0 cycles? Do you really think that something similar can be done for 1 cycles? 000000013F6C1036 8B C1 mov eax,ecx 000000013F6C1038 D1 E8 shr eax,1 000000013F6C103A FF C0 inc eax 000000013F6C103C 48 C1 E0 04 shl rax,4 000000013F6C1040 48 03 D0 add rdx,rax
  10. If I had calculated the cycles as cycles and not as factors by which to compare performance, and the question was about why the same opcode different performance all your facts would be appropriate, but we are talking about a completely different algorithms.   You understand that simply refuse to hear me.
  11. These facts are not relevant to the question, and only add confusion to the fact that: Why do people believe in the priority over the correct result (cycles or time) Disabled (/Od) like this:   // test example binary search pointer _NOINLINE findVoidSortMap(void ** list,void *key) { 000000013F1B1000 48 89 54 24 10 mov qword ptr [rsp+10h],rdx 000000013F1B1005 48 89 4C 24 08 mov qword ptr [rsp+8],rcx 000000013F1B100A 48 83 EC 38 sub rsp,38h if (!list) return 0; 000000013F1B100E 48 83 7C 24 40 00 cmp qword ptr [list],0 000000013F1B1014 75 07 jne findVoidSortMap+1Dh (13F1B101Dh) 000000013F1B1016 33 C0 xor eax,eax 000000013F1B1018 E9 DB 00 00 00 jmp findVoidSortMap+0F8h (13F1B10F8h) if (!*list) return 0; 000000013F1B101D 48 8B 44 24 40 mov rax,qword ptr [list] 000000013F1B1022 48 83 38 00 cmp qword ptr [rax],0 000000013F1B1026 75 07 jne findVoidSortMap+2Fh (13F1B102Fh) 000000013F1B1028 33 C0 xor eax,eax 000000013F1B102A E9 C9 00 00 00 jmp findVoidSortMap+0F8h (13F1B10F8h) unsigned int count= **(unsigned int**)list; 000000013F1B102F 48 8B 44 24 40 mov rax,qword ptr [list] 000000013F1B1034 48 8B 00 mov rax,qword ptr [rax] 000000013F1B1037 8B 00 mov eax,dword ptr [rax] 000000013F1B1039 89 44 24 10 mov dword ptr [count],eax char *p=(char*)*list; 000000013F1B103D 48 8B 44 24 40 mov rax,qword ptr [list] 000000013F1B1042 48 8B 00 mov rax,qword ptr [rax] 000000013F1B1045 48 89 04 24 mov qword ptr [rsp],rax p+=4; 000000013F1B1049 48 8B 04 24 mov rax,qword ptr [rsp] 000000013F1B104D 48 83 C0 04 add rax,4 000000013F1B1051 48 89 04 24 mov qword ptr [rsp],rax Element *b=(Element *)p; 000000013F1B1055 48 8B 04 24 mov rax,qword ptr [rsp] 000000013F1B1059 48 89 44 24 18 mov qword ptr [b],rax _voidint skey=(_voidint)key; 000000013F1B105E 48 8B 44 24 48 mov rax,qword ptr [key] 000000013F1B1063 48 89 44 24 08 mov qword ptr [skey],rax while (count>0) { 000000013F1B1068 83 7C 24 10 00 cmp dword ptr [count],0 000000013F1B106D 0F 86 83 00 00 00 jbe findVoidSortMap+0F6h (13F1B10F6h) void** kt=(void**)&b[count>>1]; 000000013F1B1073 8B 44 24 10 mov eax,dword ptr [count] 000000013F1B1077 D1 E8 shr eax,1 000000013F1B1079 8B C0 mov eax,eax 000000013F1B107B 48 6B C0 10 imul rax,rax,10h 000000013F1B107F 48 8B 4C 24 18 mov rcx,qword ptr [b] 000000013F1B1084 48 03 C8 add rcx,rax 000000013F1B1087 48 8B C1 mov rax,rcx 000000013F1B108A 48 89 44 24 28 mov qword ptr [kt],rax _voidint rkey=(_voidint)kt[0]; 000000013F1B108F 48 8B 44 24 28 mov rax,qword ptr [kt] 000000013F1B1094 48 8B 00 mov rax,qword ptr [rax] 000000013F1B1097 48 89 44 24 20 mov qword ptr [rkey],rax if (skey==rkey) return (void**)&kt[1]; 000000013F1B109C 48 8B 44 24 20 mov rax,qword ptr [rkey] 000000013F1B10A1 48 39 44 24 08 cmp qword ptr [skey],rax 000000013F1B10A6 75 0B jne findVoidSortMap+0B3h (13F1B10B3h) 000000013F1B10A8 48 8B 44 24 28 mov rax,qword ptr [kt] 000000013F1B10AD 48 83 C0 08 add rax,8 000000013F1B10B1 EB 45 jmp findVoidSortMap+0F8h (13F1B10F8h) if (skey>rkey) {b+=(count>>1)+1;count--;} 000000013F1B10B3 48 8B 44 24 20 mov rax,qword ptr [rkey] 000000013F1B10B8 48 39 44 24 08 cmp qword ptr [skey],rax 000000013F1B10BD 7E 28 jle findVoidSortMap+0E7h (13F1B10E7h) 000000013F1B10BF 8B 44 24 10 mov eax,dword ptr [count] 000000013F1B10C3 D1 E8 shr eax,1 000000013F1B10C5 FF C0 inc eax 000000013F1B10C7 8B C0 mov eax,eax 000000013F1B10C9 48 6B C0 10 imul rax,rax,10h 000000013F1B10CD 48 8B 4C 24 18 mov rcx,qword ptr [b] 000000013F1B10D2 48 03 C8 add rcx,rax 000000013F1B10D5 48 8B C1 mov rax,rcx 000000013F1B10D8 48 89 44 24 18 mov qword ptr [b],rax 000000013F1B10DD 8B 44 24 10 mov eax,dword ptr [count] 000000013F1B10E1 FF C8 dec eax 000000013F1B10E3 89 44 24 10 mov dword ptr [count],eax count=count>>1; 000000013F1B10E7 8B 44 24 10 mov eax,dword ptr [count] 000000013F1B10EB D1 E8 shr eax,1 000000013F1B10ED 89 44 24 10 mov dword ptr [count],eax } 000000013F1B10F1 E9 72 FF FF FF jmp findVoidSortMap+68h (13F1B1068h) return 0; 000000013F1B10F6 33 C0 xor eax,eax } 000000013F1B10F8 48 83 C4 38 add rsp,38h 000000013F1B10FC C3 ret Which gives 150 in the first embodiment and the second 250.   Excusing the behavior of this algorithm Maximize Speed (/O2): // test example binary search pointer _NOINLINE findVoidSortMap(void ** list,void *key) { 000000013F6C1000 4C 8B CA mov r9,rdx if (!list) return 0; 000000013F6C1003 48 85 C9 test rcx,rcx 000000013F6C1006 74 41 je findVoidSortMap+49h (13F6C1049h) if (!*list) return 0; 000000013F6C1008 48 8B 11 mov rdx,qword ptr [rcx] 000000013F6C100B 48 85 D2 test rdx,rdx 000000013F6C100E 74 39 je findVoidSortMap+49h (13F6C1049h) unsigned int count= **(unsigned int**)list; 000000013F6C1010 8B 0A mov ecx,dword ptr [rdx] char *p=(char*)*list; p+=4; 000000013F6C1012 48 83 C2 04 add rdx,4 Element *b=(Element *)p; _voidint skey=(_voidint)key; while (count>0) { 000000013F6C1016 85 C9 test ecx,ecx 000000013F6C1018 74 2F je findVoidSortMap+49h (13F6C1049h) 000000013F6C101A 66 0F 1F 44 00 00 nop word ptr [rax+rax] void** kt=(void**)&b[count>>1]; 000000013F6C1020 8B C1 mov eax,ecx 000000013F6C1022 48 D1 E8 shr rax,1 000000013F6C1025 48 C1 E0 04 shl rax,4 000000013F6C1029 48 03 C2 add rax,rdx _voidint rkey=(_voidint)kt[0]; 000000013F6C102C 4C 8B 00 mov r8,qword ptr [rax] if (skey==rkey) return (void**)&kt[1]; 000000013F6C102F 4D 3B C8 cmp r9,r8 000000013F6C1032 74 18 je findVoidSortMap+4Ch (13F6C104Ch) if (skey>rkey) {b+=(count>>1)+1;count--;} 000000013F6C1034 7E 0F jle findVoidSortMap+45h (13F6C1045h) 000000013F6C1036 8B C1 mov eax,ecx 000000013F6C1038 D1 E8 shr eax,1 000000013F6C103A FF C0 inc eax 000000013F6C103C 48 C1 E0 04 shl rax,4 000000013F6C1040 48 03 D0 add rdx,rax 000000013F6C1043 FF C9 dec ecx count=count>>1; 000000013F6C1045 D1 E9 shr ecx,1 Element *b=(Element *)p; _voidint skey=(_voidint)key; while (count>0) { 000000013F6C1047 75 D7 jne findVoidSortMap+20h (13F6C1020h) } return 0; 000000013F6C1049 33 C0 xor eax,eax } 000000013F6C104B C3 ret if (skey==rkey) return (void**)&kt[1]; 000000013F6C104C 48 83 C0 08 add rax,8 } 000000013F6C1050 C3 ret Which gives 48, 113. As someone the behavior of the compiler does not give an idea of the work of the full code
  12. I can give another example where the search is performed not one but a number of values to exclude the possibility of prediction and to impede the search. and the result is still less than 150. #ifdef _MSC_VER #include <windows.h> #endif #include <stdio.h> #define max_count_calc 100000 #define _voidint long long #if defined(_M_IX86) && !defined(CPU_X86_64) #define CPU_X86_32 #endif #ifdef _MSC_VER #ifndef _NOINLINE #define _NOINLINE __declspec(noinline) #endif #endif #ifdef __GNUC__ #ifndef __MINGW32__ #ifndef _NOINLINE #define _NOINLINE __attribute__((noinline)) #endif #endif #endif struct Element { void *key; void *val; }; typedef void ** pointer; // test example binary search pointer _NOINLINE findVoidSortMap(void ** list,void *key) { if (!list) return 0; if (!*list) return 0; unsigned int count= **(unsigned int**)list; char *p=(char*)*list; p+=4; Element *b=(Element *)p; _voidint skey=(_voidint)key; while (count>0) { void** kt=(void**)&b[count>>1]; _voidint rkey=(_voidint)kt[0]; if (skey==rkey) return (void**)&kt[1]; if (skey>rkey) {b+=(count>>1)+1;count--;} count=count>>1; } return 0; } #ifdef _MSC_VER #ifdef CPU_X86_32 long long _NOINLINE _TSCDCPU() { _asm rdtsc; } #else long long _NOINLINE _TSCDCPU() { return __rdtsc(); } #endif #endif #ifdef __GNUC__ #ifndef __MINGW32__ long long _NOINLINE _TSCDCPU() { asm ( " rdtsc " ); } #endif #endif typedef void* (*tfSTDCALL_p_FUNC_p_p)(void*,void*); tfSTDCALL_p_FUNC_p_p MPFind=(tfSTDCALL_p_FUNC_p_p)&findVoidSortMap; int main(int argc, char *argv[]) { // allok array void *map1=(void *)new char[sizeof(int)+sizeof(void*)*2*1024]; char *p=(char *)map1; // to set the number of elements equal to 1024 int *c=(int *)p; *c=1024; // fill in the elements of array key:values 0:0 10:1 20:2 30:3.. p+=sizeof(int); for (int i=0;i<1024;i++) { void**key=(void**)p; *key=(void*)(i*10); p+=sizeof(void*); void**val=(void**)p; *val=(void*)i; p+=sizeof(void*); } // values for finding void **rezvalues=(void **)new char[sizeof(void*)*1024]; void **findvalues=(void **)new char[sizeof(void*)*1024]; for (int i=0;i<1024;i++) { findvalues[i]=(void*)(i*10); } void **rez; for (int j=0;j<10;j++) { int i=max_count_calc; LARGE_INTEGER s1; LARGE_INTEGER s2; void** rezval=rezvalues; void** val=findvalues; void** max=findvalues; max+=1024; QueryPerformanceCounter(&s1); // to run the test 100000 times while (i>0) { *rezval=(void **)MPFind(&map1,(void*)*val); i--; val++; rezval++; if (val>=max) { val=findvalues; rezval=rezvalues; } } QueryPerformanceCounter(&s2); // to calculate the average result for a single test s1.QuadPart=(s2.QuadPart-s1.QuadPart); double s3=s1.QuadPart*(1000.0/max_count_calc); printf("%lf \n",s3); } printf( "%i", **(int **)rezvalues+1 ); scanf( "%d", &rez ); delete[] (char*)rezvalues; delete[] (char*)findvalues; delete[] (char*)map1; return 0; } #ifdef _MSC_VER int __stdcall WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, INT) { return main(0,0); } #endif 120.220000 116.510000 113.500000 115.470000 115.960000 119.300000 119.410000 119.130000 119.450000 119.420000   But you need to take into account the fact that there is a reference to the data array   and if I will do a test with Disabled (/Od) the result will be: 249.760000 253.250000 249.150000 250.800000 247.830000 254.040000 252.100000 248.670000 249.330000 239.590000 It is an objective result?
  13.   Don't fully understand the written. It is not about how you can run the code in "C" how many cycles and how to get the most effective opcodes. With full functionality. And it's not that the code can be effectively optimized by the compiler. But the fact that people confuse these events with the real situation.     Such statements can say that you have no idea what exactly is going on in the processor and may occur during its operation on a single core. The processor Core i have no possibility to parallelize the linear code, the results of each instruction which is dependent on previous instructions. And a tech is not able to share this code between processor cores. He can't see the future without calculations. Or the new processor is able to guess the cards? :)   It's all very indirectly relates to the question posed, and can see to justify a wrong answer to the wrong question.   Yes we can use inline and in fact the compiler is able to deploy a fully functional method. And the result can be 40 cysles. Or to boil it down to one simple constant calculation result and we get 1 cysles or 0. But to assert that the correct result at 150 cysles. Given that there are a large number of memory accesses. While the result for 48 cycles shows us the difference when running the same algorithm in registers and that it is not correct - this is misleading. Misleading to say that the result in this particular example, giving 48 cycles is the result of optimizatsii compiler to constant calculation. Impossible to talk about the fact that the sample obtained with the remove loop and gives the result to 4800000 cycles
  14. This loop in Release x64 Maximize Speed (/O2) 000000013F2910D0 48 8D 4C 24 60 lea rcx,[s1] 000000013F2910D5 BB A0 86 01 00 mov ebx,186A0h 000000013F2910DA FF 15 20 0F 00 00 call qword ptr [__imp_QueryPerformanceCounter (13F292000h)] // to run the test 100000 times while (i>0) { rez=(void **)MPFind(&map1,(void*)0); 000000013F2910E0 48 8D 4C 24 68 lea rcx,[map1] 000000013F2910E5 33 D2 xor edx,edx 000000013F2910E7 FF 15 13 1F 00 00 call qword ptr [MPFind (13F293000h)] i--; 000000013F2910ED FF CB dec ebx 000000013F2910EF 48 89 44 24 28 mov qword ptr [rez],rax 000000013F2910F4 85 DB test ebx,ebx 000000013F2910F6 7F E8 jg main+70h (13F2910E0h) } QueryPerformanceCounter(&s2); 000000013F2910F8 48 8D 4C 24 20 lea rcx,[s2] 000000013F2910FD FF 15 FD 0E 00 00 call qword ptr [__imp_QueryPerformanceCounter (13F292000h)]   with a result: 47.070000 46.920000 46.630000 47.230000 46.080000 45.990000 46.500000 46.510000 46.800000 46.720000 And this remove loop in Release x64 Maximize Speed (/O2) QueryPerformanceCounter(&s1); 000000013F7110C3 48 8D 4C 24 60 lea rcx,[s1] 000000013F7110C8 FF 15 32 0F 00 00 call qword ptr [__imp_QueryPerformanceCounter (13F712000h)] // to run the test 100000 times while (i>0) { rez=&map1;//(void **)MPFind(&map1,(void*)0); 000000013F7110CE 4C 8D 5C 24 68 lea r11,[map1] i--; } QueryPerformanceCounter(&s2); 000000013F7110D3 48 8D 4C 24 20 lea rcx,[s2] 000000013F7110D8 4C 89 5C 24 28 mov qword ptr [rez],r11 000000013F7110DD FF 15 1D 0F 00 00 call qword ptr [__imp_QueryPerformanceCounter (13F712000h)] with a result: 0.000000 0.000000 0.000000 0.001000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  15. Now I used the QueryPerformanceCounter, and the result is still 48. Now what's wrong? #ifdef _MSC_VER #include <windows.h> #endif #include <stdio.h> #define max_count_calc 100000 #define _voidint long long #if defined(_M_IX86) && !defined(CPU_X86_64) #define CPU_X86_32 #endif #ifdef _MSC_VER #ifndef _NOINLINE #define _NOINLINE __declspec(noinline) #endif #endif #ifdef __GNUC__ #ifndef __MINGW32__ #ifndef _NOINLINE #define _NOINLINE __attribute__((noinline)) #endif #endif #endif struct Element { void *key; void *val; }; typedef void ** pointer; // test example binary search pointer _NOINLINE findVoidSortMap(void ** list,void *key) { if (!list) return 0; if (!*list) return 0; unsigned int count= **(unsigned int**)list; char *p=(char*)*list; p+=4; Element *b=(Element *)p; _voidint skey=(_voidint)key; while (count>0) { void** kt=(void**)&b[count>>1]; _voidint rkey=(_voidint)kt[0]; if (skey==rkey) return (void**)&kt[1]; if (skey>rkey) {b+=(count>>1)+1;count--;} count=count>>1; } return 0; } #ifdef _MSC_VER #ifdef CPU_X86_32 long long _NOINLINE _TSCDCPU() { _asm rdtsc; } #else long long _NOINLINE _TSCDCPU() { return __rdtsc(); } #endif #endif #ifdef __GNUC__ #ifndef __MINGW32__ long long _NOINLINE _TSCDCPU() { asm ( " rdtsc " ); } #endif #endif typedef void* (*tfSTDCALL_p_FUNC_p_p)(void*,void*); tfSTDCALL_p_FUNC_p_p MPFind=(tfSTDCALL_p_FUNC_p_p)&findVoidSortMap; int main(int argc, char *argv[]) { // allok array void *map1=(void *)new char[sizeof(int)+sizeof(void*)*2*1024]; char *p=(char *)map1; // to set the number of elements equal to 1024 int *c=(int *)p; *c=1024; // fill in the elements of array key:values 0:0 10:1 20:2 30:3.. p+=sizeof(int); for (int i=0;i<1024;i++) { void**key=(void**)p; *key=(void*)(i*10); p+=sizeof(void*); void**val=(void**)p; *val=(void*)i; p+=sizeof(void*); } void **rez; for (int j=0;j<10;j++) { int i=max_count_calc; LARGE_INTEGER s1; LARGE_INTEGER s2; QueryPerformanceCounter(&s1); // to run the test 100000 times while (i>0) { rez=(void **)MPFind(&map1,(void*)0); i--; } QueryPerformanceCounter(&s2); // to calculate the average result for a single test s1.QuadPart=(s2.QuadPart-s1.QuadPart); double s3=s1.QuadPart*(1000.0/max_count_calc); printf("%lf \n",s3); } scanf( "%d", &rez ); return 0; } #ifdef _MSC_VER int __stdcall WinMain(HINSTANCE hInst, HINSTANCE, LPSTR, INT) { return main(0,0); } #endif ?gain /* ... loop removed ... */ ? You understand that this is a stable result: 46.550000 47.270000 48.050000 47.340000 46.330000 48.400000 47.120000 46.410000 47.360000 47.500000 And this does not mean that influence the results of parallel threads and processes.Like this: 0.000000 0.000000 0.000000 0.000000 47.500000 0.000000 0.000000 0.000000 0.000000 0.000000
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!