Sorry, I don't have many links to documents to help you understand the drivers/hardware behaviour, but it would be interesting to hear your approach to isolating these steps.
in my master thesis i measured some timings of different steps in my pipeline. 2 of 6 steps depend on the geometry of the input model.. the rest is fixed.. (e.g. rendering a background scene, composition of targets to final output, ...)
Typically it's hard to measure any single operation -- e.g. composing targets executes a cheap shader (low chance of ALU being the bottleneck), but puts a lot of pressure on texture fetching and an equal amount on ROP, making it hard to tell which of the two is the bottleneck.
Also, even within a single generation of GPUs, the difference in bandwidth between low and high-end models can differ by ~20x, meaning an operation that's ALU-bound on one model may be fetch-bound on another model.
There's a careful balance between fetch latencies (and the GPUs bandwidth), the number of temp registers required by a shader (and the size of the GPU's register file) and the amount of non-fetch (ALU) work (and the GPU's speed at ALU work).
e.g. if a shader is fetch-bound, then decreasing temp-reg usage (or increasing the GPU's reg file size) might remove that bottleneck. In the same situation, you may be able to add more ALU work "for free" (assuming you don't add more temp-registers as well), or you might be able to assume that when moving to a GPU with lower ALU-speed, then performance won't be affected.
What do you mean exactly? The timings change run-to-run, as in they're random? Or they've changed from the previous GPU? Or they change based on some API state that's set?
on another GPU (AMD..something).. everything is changing (more or less slightly)..