What do you mean lower in the cache?
I just mean lower -- The ALU/FPU only takes from registers, and continuing upwards the registers only take from L1, the L1 only takes from L2, L2 from L3, and L3 from system RAM (assuming no L4, which are rare and typically victim caches anyways). If the data you want was previously evicted from L1 but still exists at L2, then you pay L2 latency on first access and L1 latency on subsequent accesses of that cache line. What I meant is that you don't achieve smallest latency unless you consistently work out of L1. Since L1 is very small, competitions for cache lines is very high.
Latency matter for things in the L1 as well, since latency for L1 is around what 7cycles now.
Yes, something like that. But as I said, competition is very high and you're likely to find yourself evicted if you lollygag. Furthermore, if you aren't making efficient use of your cacheline, then you need to load another sooner, which evicts some other cache line whose process may not be done with it, causing it to reload from L2 next time around -- all of which adds to the competition problem. Keep in mind your own app may be running several threads on a physical core, not to mention all the OS threads and other apps.
Each level of the memory hierarchy has an associated latency, these latencies never change -- only the level of the hierarchy in which your data can be found causes its aparent latency to go down. You don't have any control over which level it's found in, either, you only know that if you're reading it now its in L1 and that it probably won't be there long. The only influence you can exert, then, is by being ready to use the cache-line so that you can retire it as quickly as possible before the competition causes it to be evicted -- you want to do as much work as possible with every cache line, and that means the cache line needs to be filled with useful data.
Out of order execution with large instruction windows and speculative execution can make it as if accessing the L1 is like accessing a register. (not exactly but you see what I'm getting at)
Yes, of course, closer is better. Fancy modern processors make L1 retrieval so fast that its almost free -- it only takes a few independent instructions to mask it. But again, you've got to keep it in L1 for that to hold, and that's hard to do. Luckily, what's good for L3 cache is also good for L1 cache, so you don't have to think much about the differences (only really when you start talking about the size of your active data set, since obviously 256k of 'hot' data isn't going to fit in L1) -- for smaller working sets that fit inside L1 (or especially one-and-done cache line accesses) its well enough to just worry about what's good for L1 cachability. Regardless of cache level though, the principles of well-predicted prefetches and dense cachelines hold true.