*ends up skipping most of the thread*
Honestly the article seems to be pointing out a single problem that I'm honestly tired of: everything gets allocated in the heap. Depending on the languages even simple integers can end up like this. This alone will bring down the CPU to its knees, constantly allocating and deallocating, cache coherency getting completely destroyed, the GC going crazy having to deallocate stuff all the time, etc.
Also wow that Haswell chart. I just made a quick calculation, a program that's constantly causing cache misses would be easily brought down to the equivalent of sub-100MHz speeds (cache miss every access would be 13MHz on a 3GHz Haswell, but I imagine the code to execute has cache coherency =P). No wonder so many modern programs seem to be slow despite the fact computers should be much faster. This is probably a bullshit calculation honestly but it would explain a lot.
Moreover, the compiler is usually better at asm than us, so it takes our intentions and cleverly optimizes them better than we ever could, using it's super-human knowledge of instruction scheduling, dependency chain tracking, register juggling abilities, etc...
Yes.
I've once written some simple matrix multiplication code in the laziest way possible (double for loop) in order to make the code as simple as possible. Decided to pass it through GCC with optimizations at maximum to see what it'd do. Cue the entire thing having been optimized away into an unrolled loop full of SIMD instructions (also having crammed in two matrices entirely into registers). Then I've decided to do transformation functions, but just passing the matrices as-is instead of trying to optimize them out (since I know most calculations become redundant, due to lots of 0s and 1s). So GCC completely inlined the matrix multiplication function, then optimized out that inlined function taking into account the constant values (basically, doing the very thing I refused to do in the source code).
Now, that was C, but basically don't underestimate the compiler. Just make sure the code is reasonably simple enough that the compiler will be able to catch the idiom. This is the code, if somebody wonders.