This is a fairly involved discussion. Here is the 10,000 foot view.
Let's assume we have a CPU connected to a memory. The CPU can process data 100x faster than memory can be read or written from this memory. If the CPU needs the value of some memory location loaded then it's spending roughly 100 cycles waiting for the memory to respond. What can be done to speed-up the performance?
This is where the memory hierarchy enters. A smaller memory can be interposed between the CPU and the larger memory. While its capacity is much smaller than the external memory it is much quicker. What good does this do?
It wouldn't do an ounce of good unless you can continually rely on the cache to repeatedly use the same information or data over and over again. If you think to your code, this happens all the time: loops use the same instructions and data over and over. Function calls use the same code in memory. Routines typically use the same data throughout the routine. So this smaller but quicker memory can be an effective solution because code and data tend to be reused. This principle is known as instruction and data locality. Without it, caches would be useless.
So let's give a high level example. Assume 1 cycle = 1 CPU operation performed, 10 cycles = a 'hit' on your cache (a hit means your local data or instruction is present) and it takes 100 cycles to access data from main memory not contained in the cache.
Let's compare two machines, one with CPU + main memory and another with CPU + cache + main memory.
On the cacheless computer, sequential accesses to instructions are always at a cost of 100 cycles per instruction (this is a simplification btw -- typically CPUs fetch a batch of instructions at one time). Thus if we have 10 sequential memory accesses we spend 1000 cycles.
On the computer with a cache let's assume that the same 10 instructions are used as in the prior example. Let's further assume that the first 3 instructions are an iteration through a loop. The cost of access is 3x100 for the first three instructions. Assuming these instructions remain in our cache we will hit on the remaining 7 instructions at a cost of 7x10 cycles. We spend 370 cycles on the system with the cache.
Comparing both cases, to perform the same work it takes 1000 cycles on the cacheless machine vs. 370 cycles on the machine a cache, nearly a 3x speed-up.
There is much more to this discussion but that's the general idea of how caching works. Modern computer systems have a hierarchy of caches usually designated L0,L1,L2,L3, etc. Each increasing number indicates the memory "distance" away from the CPU -- the intervening amount of memory between it and the CPU. An L3 means the system has 3 caches between it and the CPU, for example (L0,L1,L2). Typically you'll find the L0 and L1 buried inside the CPU itself if you were to look at the implementation architecture. They are also typically tightly coupled, meaning you need both in operation for the CPU to run. Whereas higher level caches have the feature to be enabled or disabled depending on your needs. Mobile processing systems usually allow this to happen to conserve power.
Anyways, there's more to your question and I'll try to answer later when I have more time. An awesome book to cut your teeth on: Computer Architecture, A Quantitative Approach by Hennessy & Patterson. It is considered a classic.