Out Of order execution

Started by
3 comments, last by Adam_42 13 years, 4 months ago
im curious as to why i see assembly code like this

mov eax,DWORD PTR[0x3434]; <--- recieve value from memory and store in register
add eax,eax; <--- add value and store result in eax

is this inefficient code when people do this? for example, the CPU will need to stall to allow the latency of sending data from memory to a register. Either that or this code will yield incorrect results due to the value not being in a register yet. Anyways, would it be more efficient to write this code out of order, for example


mov eax,DWORD PTR[0x3434]; <--- recieve value from memory and store in register
.....
......
.........
...........
add eax,eax; <--- add value and store result in eax

would this cause the more efficient use? by the time u get to the add instruction, you will already have the value loaded into the register and good to go. -thx
Advertisement
It entierly depends on the processor.
A regular desktop processor does Out-Of-Order instruction re-scheduling. It has a large queue of instructions that it can schedule at any one time. If it can move them around, it will try to remove as many stalls as it can.
Something like a netbook Atom processor is In-Order, and won't reschedule it, resulting in a stall.
Then there is other technology, like hyperthreading. Your thread may stall at the mov, but the other active thread on this core may have instructions that can run.
Quote:Original post by nuclear123
would this cause the more efficient use?
It's a superset of an NP-complete problem.

Compilers try to optimize this for small cases, but general optimal solution isn't known. Even a simpler subset of the problem is very hard to solve.
As explained above, OOO execution is managed by hardware, and not the programmer.

Given you're programming in a higher language, any standard, well established compiler would do software ILP (re-scheduling), with the consideration of any potential dependencies/hazards.

If you don't write in a higher language, but actually hand-code in assmebly language, then what you suggested may optimize the pipeline throughput, but keep in mind that you would have to be highly aware of the instructions you interject as well, taking in account the various stall penalties between different types of instructions.
It's also worth noting that it's generally impossible to schedule instructions far apart enough to hide the length of stall you get from actually reading main memory - it could be hundreds of clock cycles. To avoid those stalls you need prefetching either manual with prefetch instructions or automatic via the CPUs built in prefetch logic (if it has any). Instruction schedulers generally assume that data is already in the L1 cache on the CPU.

This topic is closed to new replies.

Advertisement