Ravyne your entire post was super insightful! That number 1 is definitely the key, figure if you can do that part right then optimazation should become less and less of an issue. For me I'd rehash the thing till it's blazing fast and after awhile I suppose if you're good enough at that you can almost fit all of those steps into just number one there from the start. My C code can be unruly at times because I like to avoid using too many pointers but that's a whole different topic. Suppose to goes back to the performance thing being essential.
I'm glad you found it useful, but I'm worried you misinterpret it a bit -- I want to drive home the point that is in no way meant to be viewed as something you can compress into a single step, nor should you want to. It's a process that aims to put your efforts squarely where they belong, while leaving code at the most abstracted level that it can meet performance requirements in. Trying to compress it all into one step is impossible without making dangerous assumptions that will end up costing you time, maintainability, and performance. An expert might skip steps 6 and 7 if, and only if, they are certain through wisdom and experience that the compiler cannot or will not generate the best code -- but they will never simply assume that to be the case, nor would they skip the earlier steps without first having hard data to prove that it falls into a performance hot-spot.
I'll share something I learned just yesterday which is tangential but illustrative of why the thing you think will work best, often doesn't.
I attended a meeting of the Northwest C++ Users Group last night. The topic was Visual Studio's Profile-Guided Optimization (PoGO, for short) feature and how it works. In brief, PoGO works by instrumenting a build of your application which you then run through various performance-sensitive scenarios in order to train it with real data. Then, you do the real (release) build in a way that incorporates that training to help inform the compiler how to generate the best code for real use-cases. For example, based on whether the conditional is likely to be true or false, it might swap the order of conditional branches in an 'if' statement so that the processor speculatively executes the correct branch in the majority of cases. If it does so, the CPU stalls less, and performance is increased. It also applies what it's learned about how-often, and from where every function call is made (this infuences whether the function should be inlined or not). It's all very complex, and I'm simplifying it here, but that's what it does in a nutshell.
When we got to the end of the presentation and the speaker was comparing results of PoGO-compiled code vs code compiled with -O2 (the highest level of non-PoGO optimization Visual Studio's compiler supports) there were some really interesting results. Not only was the PoGO-compiled code faster, it was also smaller, and it also only inlined about 5% of the overall call sites, vs around 20% or higher that were inlined by the -O2-compiled code. Now, it performs a number of other optimizations to achieve that, but think about those stats on their own -- The PoGO code used far less inlining than the -O2 code, and was faster and was smaller, all at the same time. Best performance is not achieved by being most-agressive with potential optimizations, its achieved by being really smart about where optimizations are applied, and by applying them in the context of real data about real scenarios.
Lets think about that another way: For all of the thousands of PhD-hours and hundreds of millions of dollars thrown into compiler research over decades, not even the compiler (and one of the best in the world, at that) can generate its best code without first profiling it!