Why does this benchmark behave this way?

Started by
28 comments, last by swiftcoder 9 years, 3 months ago

phantom

no I'm just talking about Intel. Question Elbrus concerns indirectly. Just because the developers claim that I underestimate the time for Intel. And what I'm doing tests on incorrect code that throws and simplifies the compiler. But this is not true.

Advertisement

Remember: the assembly you see is NOT what the CPU executes in the ALU units - while the assembly might look like a CISC under the hood Intel CPUs have been converting to RISC for some time now.

I perfectly understand it. But none of these commands cannot be placed in one command block RISC Core i. Except for some mini operations as "inc" or "shr/shl" which is performed in the device forward and can be carried out together with the ordinary operation on a single value. But even on RISC commands such as "inc" or "shr/shl" can't run in the same command block with the same value.


000000013F6C1036 8B C1                mov         eax,ecx  
000000013F6C1038 D1 E8                shr         eax,1  
000000013F6C103A FF C0                inc         eax  
000000013F6C103C 48 C1 E0 04          shl         rax,4  
000000013F6C1040 48 03 D0             add         rdx,rax
Depending on the core wiring, and I have no direct knowledge of that, there is no real reason why a shift-right-one-and-add-1 couldn't be a microcoded op which could reduce the operational count from the 4 district x86 instructions to a single instruction in the core (source from 'ecx' -> shift-right-and-inc -> sink to 'rax'), at which point yes, the 'shl' and 'add' instructions are likely too general case to be done that way (although shl-4 might be an accelerated path with a register sourced add, depends on how common it is) but even so your 5 x86 ops could be reduced to 3 internal ops, maybe even 2.

Depending on the surrounding code other bits might be interleaved with that operation too (such as the source for rdx and any following code).

Right, and now look at the code under test, and surprised the next statement (like loop removed), which either do not have a testing example.

The results were amazing. Many instructions could potentially be done in parallel, bringing the cost down to an effective zero cycles.

This may only apply to individual statements and not the entire method.

Even though you have some horrible things in your code, like turning your 64-bit values into floating point numbers and then converting them back

As I have already said it is necessary to perform a multiplication by a constant floating point instead of integer division. What is more quickly

No magic needed.

I didn't say not a word about magic, I'm merely stating that for any action processor takes time, and if he chooses a certain path calculations, he needs to spend the time or will lose it if there is no correct choice.
And above all, if something gives a more efficient algorithm that does not mean that it is not possible to track.

You talk a lot of theory, but no one tells us what is really going on, because you have about this is only a theoretical idea

You write a lot, but its reads like you used google translate and near impossible to understand. Then people still answer and you dont understand them. You would be better off asking some people you can actually communicate with in Russian.rolleyes.gif

Or go directly to the source and look for technical manuals on Intel or AMD page.


but no one tells us what is really going on, because you have about this is only a theoretical idea

If you want to know what is going on, read this. Read it VERY carefully. It will explain all the details.

Give special attention to section 2.2, which gives a good technical overview of the process. Continue reading if you want more detail, that provides about all the technical details that matter.

On the current 4th generation i7 machines each core can do:

* up to 10 instructions can be decoded on a single cycle on an HT core

* up to 12 ALU micro-ops every cycle

* up to 8 FPU operations every cycle

Note in your disassembly that your compiler (which is smarter than you) chose to include some operations that use the FPU so it could take advantage of all that other speed.

Many instructions take 0 clock ticks to execute. They are free.

Many instructions get interleaved with their surrounding instructions. One one slow instruction stalls retirement, surrounding operations take effectively 0 cycles to execute. They are also free.

As for predicting the future, yeah, they can do that to. Branch prediction means that when you have conditional operations in your code the processor will predict which will be taken, and when unsure, starting both branches before you reach them. The Branch prediction's tracing routines means that in a tight loop the CPU can remember what it did last time making the entire loop effectively free. Speculative execution means when one of the Core's internal ports has nothing else to do it will search for work coming up that doesn't have any pending dependencies and start doing that work.

SO what does all that mean:

* if you have a CPU instruction that takes quite some time to execute, all the instructions around it require zero processor cycles.

* If a small number of instructions stall waiting for memory, you can potentially get over 100 other instructions all done with zero processor cycles.

* The processor can remember tight loops (like your code) so it doesn't even need to decode it, the processor can just skip ahead to the answer.

There is almost zero correlation between a single instruction and the number of CPU cycles it takes to execute. How long an instruction takes depends almost entirely on context, on what else is going on around it.

Using the CPU counter can work for counting time. Many instructions require zero time.

I mean what is in your mind, there is clearly something wrong, and that it going on I do not know

Many instructions take 0 clock ticks to execute. They are free.

Many instructions get interleaved with their surrounding instructions. One one slow instruction stalls retirement, surrounding operations take effectively 0 cycles to execute. They are also free.

I'm not talking about a single instruction and about the whole algorithm, and it is quite possible to count the number of cycles entirely. ok?

Nobody in i7 artificial intelligence did not put that each time he gave different results for the short code. In addition compression means 40 instructions , and with a loop to 0 cycles.

If you are one of those who read the tea leaves as these instructions can interleaved and what can be 0 or maybe 50 it is deceptive.

I presented quite clear conditions for which there is a stable result.

You write a lot, but its reads like you used google translate and near impossible to understand. Then people still answer and you dont understand them. You would be better off asking some people you can actually communicate with in Russian.rolleyes.gif

Or go directly to the source and look for technical manuals on Intel or AMD page.

I do not need those answers I already know them, sorry

I've encountered enough people explaining to me how to walk on earth and how to open your mouth, and showing the various documents which I have studied, but the question relates to how this knowledge is used and I really see that the majority is wrong.

If you communicate with the developers in principle, it is enough to understand that most of what they write is warnings and is not a dogma. That is all determined by the conditions of the situation which you can capture and study

This topic is closed to new replies.

Advertisement