What is bad about a cache miss since a initial cache miss is inevitable

Started by
12 comments, last by megadan 9 years, 6 months ago
If a cache miss occurs, then the instruction with the cache miss will be send to CPU cache. This means if the instruction is processed by the CPU, it will be a cache hit.

So I don't see stressing of what makes a cache miss bad if it is bound to happen on the first try or attempt for all instructions processed by the CPU?
Advertisement

Because the cache doesn't have an infinite size? Suppose for simplicity that you have a cache that can hold only one element. What happens if you access elements A and B in this order:

A, B, A, B, A, B, A, B, A, B, A, B

How about:

A, A, A, A, A, A, B, B, B, B, B, B

How many cache misses for each? Which one hits memory the least?

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Instructions have locality to them. If you don't have any branches, the CPU will load a series of instructions into the cache, so you'll only get a cache hit for the size of the cache. It's not "one miss per instruction", but rather "one miss per N instructions", where "N" is the cache line size divided by the instruction size.
In reality, it's going to be even less than that because modern CPUs prefetch instructions, meaning they're basically streaming in more as they consume them.

Cache misses are inevitable -- you're speeding up memory accesses by mapping several gigabytes of memory space into just a few megabytes. In other words, cache lines are very limited resources, and every miss that causes a cache line to fill also means evicting whatever data was there before.

Instruction cache / cache misses are less of a big deal since the structure of the code itself just lends itself pretty naturally to being cached. For the most part, all you have to worry about is that your hot loops and any code they call out to will ideally fit inside the L1 instruction cache.

Data cache / cache misses are a much bigger deal since many data structures are non-linear or accessed randomly (e.g. accessing an array element, or iterating a linked list). When you jump all over memory looking at random bytes, you spill what might otherwise be useful data that might've already been in the cache.

Think of it like this -- all your gigabytes of main memory are divided up into little chunks of 64 bytes -- the same size as a cache line. Whenever you read a single byte from any chunk, the entire chunk is loaded into a CPU cache line. Now, most CPU caches are 8 or 16-way associative, meaning that there are 8 or 16 CPU cache lines (a 'set') that can hold onto a particular chunk copied in from main memory, and it also means that every 8th or 16th chunk shares the same 'set' of cache lines. This might sound pretty good, surely you won't be reading from 16 different, conflicting chunks at the same time, right? Well, guess what, You have 8GB of RAM and lets say 4MB of 16-way set-associative cache -- that means that 2048 chunks of main memory are competing for just 16 CPU cache lines. Every 17th new chunk you access, and every new chunk there-after, necessarily boots some other chunk out of the cache -- In the worst-case scenario, it booted something you'll need again soon, but your access pattern continues in a way that causes it to again be evicted, repeatedly; causing data to ping-pong back and forth between cache and main memory. You only get the benefit of the cache if you continue to access more bytes from that cache line.

And not only does it take a long time to load something from memory, but main memory *bandwidth* is also a limited resource. I did some back-of-the-envelope calculations using DDR3-1866 on a modern x64 platform as an example. If your game runs as 60 frames/second, and had absolutely pathological access patterns (that is, one byte is read from a cache-line and that line is never touched again), your memory bandwidth and cache lines conspire to limit you to touching less than 8 megabytes of code and data per frame. 8 megabytes is not a lot! With fully-optimal cache patterns, you can access about 500MB of code and data per frame (and you can read/write it multiple times for essentially free as long as it stays in the cache), which is a world of difference. Even still this is not a lot of data, and is a big part of the reason why some games run different parts of their simulation at different rates (for example, you might run simulation logic at 30fps, or only run 1/4th of your AI entities each frame.)

throw table_exception("(? ???)? ? ???");


Even still this is not a lot of data, and is a big part of the reason why some games run different parts of their simulation at different rates (for example, you might run simulation logic at 30fps, or only run 1/4th of your AI entities each frame.)

Not entirely related to the question by the OP, but this is one of those magical things I think a lot of people (including myself) don't realize when starting out or even for many years -- you don't need to do everything every frame, or even on a single frame at all

Caching is one of those magical things that take some understanding.

* It is not so much that a single cache miss is bad. As you wrote, they sadly happen quite frequently.

* It is much more that a single cache hit is awesome. And we are used to modern compilers and processors being very awesome.

Due to the complex nature of the modern CPU the exact benefit of a cache hit is hard to get an exact benefit, but usually it makes the operation approximately free. Again: having the data already loaded usually results in free instructions. Since you can often store 4 values in a cache line but only pay the cost for a single load, buy one get three free.

Good: If your code and data is able to operate entirely in the cache, you can see enormous speed improvements.

Bad: If your code and data doesn't play nicely with cache, you will see performance plummet since you pay full price for everything.

Let's make a little thought experiment: Imagine data carefully and artificially designed so every single memory access was a cache miss, and needed to be fetched from main memory. (Or just imagine the data cache was turned off.)

The exact time required varies based on the chipset, the motherboard, the ram chip's timings, and more. You've got those numbers on your memory chips like "9-11-11-28" or whatever, those come in to play, too. And just for fun, we'll say you've got DDR3-1600 memory since they usually take 5 nanoseconds per cycle. So if you get a trip to memory, you incur at least these costs:

CAS latency, roughly 12 nanoseconds. It just takes this long to go through.

Time to read and issue the command, from the memory chip above that is 5ns (from the 200MHz) * 11 cycles = 55 nanoseconds minimum.

Time to charge a row on the memory chip, again 11, so 11*5ns = 55 nanosecond minimum.

And the final number is the cycles of overhead so the chip can basically change gears between requests. We'll hit this every time since in this scenario nothing is cached and we need to wait for each row. So 28 * 5 = 140 ns minimum.

Then you've got overhead for getting out of the chip, processing along the system bus, and more, but we'll just discount that right now. We'll round down and call that number 250 ns for one trip out to those DDR3 memory chips instead of the cache.

So by incurring a cache miss out to main memory, you just cost your process about 250 nanoseconds.

And if your cache miss was something swapped out to disk, it can potentially take a LONG time to get that data, especially if the disk has spun down. On a slow spinning disk that was sleeping, the cost of your cache miss might be multiple seconds.

Normally that many nanoseconds is not a particularly enormous amount of time by itself, lots of things cost nanoseconds. Virtual function dispatch, for example, has a cost of roughly 10 nanoseconds. We usually don't lament the cost of a single virtual function because the cost of implementing the same functionality through other means will be at least as expensive. But with virtual functions and other costs, those nanoseconds are spent buying us functionality. With a cache miss, we pay a cost and get nothing in return.

So in our contrived little program every single memory read and memory write incurs a full trip out to main memory.

A simple little addition function, x = a + b + c, gives three loads and a store, or 4 trips to memory, costing 1 microsecond in overhead.

The problem is when it adds up quickly. If *EVERYTHING* is a cache miss out to main memory, then *EVERYTHING* gets a 250 nanosecond penalty.

4000 trips to main memory mean 1 millisecond lost. That is about 10% of a graphics frame on a modern game that you just spent in overhead.

Roughly 40,000 trips to main memory means you just paid the cost of a full graphics frame accomplishing no productive work.

So in our thought experiment with a non-caching processor, if you want a fast frame rate you are limited to around 30,000 total operations per frame. It is not because the processor is slow -- the processor can handle the work quite handily -- it is because the cost of trips to main memory dwarfs all other performance concerns.

Now, we'll remember that modern chips don't work like the thought experiment. Importantly, they do have three layers of cache turned on. The systems transfer larger memory blocks than a single byte over to multiple level of caches, often modern chips have L1, L2, and L3 caches. The fastest cache is tiny, and they get incrementally slower but also larger. So thankfully you need to make trips out to main memory very rarely.

Many times when you have a cache miss on the die it will jump out one level of cache and find the data right there, almost instantly available. That fetch serves as a warning to the cache levels outside it to prefetch the following blocks of memory. So even then a cache miss is not 250ns it may be only 30ns or 80ns or 100ns or whatever because it is already in the L1 cache, or the L2 cache, or the L3 cache, and not sitting out in main memory.

One of the biggest differences between Intel's i3/i5/i7 and similar chipsets, the thing that far and away is the biggest factor in their compute-bound performance, is the cache size. On the current Haswell chipset the cache sizes are 2 MB on the lowly Celeron, 3MB on most of the i3, either 4 or 6 MB on the i5, 8MB on the i7, and 20MB cache on the i7 Extreme. On the server lineup (e3, e5, and e7) they range from 8MB to 45MB of cache. Cache is the way the chip gets fed. It doesn't matter how fast the processor is if the processor is sitting around waiting for data to compute with. A bigger faster cache is usually the most critical performance enhancer for compute-intensive systems.

The secondary feature that interacts with caching is hyperthreading. Hyperthreading (as explained in the article lined above) basically just has the same processor inside and bolts on a second instruction decoder. It lets you run two processes on the same internal core. That way when one process is stalled by a cache miss or other performance hit, one process can stall waiting for data while the other process continues to run with data in the cache. Of course, if both processes are compute-intensive they both end up fighting for cache space. If you are not compute-bound hyperthreading can give nice speedups. When you are already fighting over cache space in compute-bound threads, hyperthreading doesn't help matters. That is why the second biggest difference between the different Intel processors is hyperthreading, which is basically just taking advantage of properties of the cache.

Hopefully that will give a bit deeper understanding about CPU caches.

TL;DR: Individual cache misses are inevitable but every cache miss has a cost with no benefit. Better cache use and bigger caches both provide better performance for compute-intensive tasks.

Instruction cache / cache misses are less of a big deal since the structure of the code itself just lends itself pretty naturally to being cached. For the most part, all you have to worry about is that your hot loops and any code they call out to will ideally fit inside the L1 instruction cache.


Unless you're talking about the kind of OOP popularized by Java and bad C++. Consider

class Interface {
public:
  virtual void DoThing() = 0;
};

// global for expositional purposes only
std::vector<std::unique_ptr<Interface>> g_AllTheThings;

void DoAllTheThings() {
  for (auto const& thing : g_AllTheThings)
    (*thing)->DoThing();
}
Each invocation of `DoThing` could be referencing a different function than the one before. Hence, you're jumping all over the place in the code. If the functions are even remotely big, each call could easily wipe the other versions of `Interface::DoThing` from the code cache. Then if the next thing uses one of those evicted implementations, we're right back to slowsville.

This is one of the reasons that people still call virtual functions slow: the actual overhead of a virtual function is all but non-existent but the cache misses from jumping around to different implementations (or even just pulling in the vtables) can add up. It really depends on the particular use.

Then there's also the cost of bloating up your objects with an extra void* for the vtable (or multiple void* for multiple inheritance) can cause cache misses to occur more frequently even for packed arrays of data. This combined with the code cache misses from invoking virtual functions on many different implementations is one of the many reasons that virtual functions and inheritance are non-ideal ways of dealing with data types and why high-perf code just tends to use disparate packed arrays for each sub-type with at most a functor passed in to operate over teh data, e.g.

struct A { int foo; }
class AProcessor {
public:
  virtual Process(A& a) = 0;
};

// global for expositional purposes only
std::vector<A> g_As;

void ForEachA(AProcessor& proc) {
  std::for_each(g_As.begin(), g_As.end(), [=](A& a){proc.Process(a);});
}
Another alternative would include making `AProcessor::Process` take a range of iterators/pointers instead. I prefer the above version (at least as a first option; you can have both) as its a bit easier to compose operations and while you are hitting the virtual function call on every iteration of the loop, it's always a call that resolves to the same vtable and the same implementation so the cache should stay warm.

Sean Middleditch – Game Systems Engineer – Join my team!

Woe is he who uses virtual dispatch on his hot loops. -- ancient Chinese proverb

Yes, certainly the cult of OOP papers over their ongoing transgressions and its leadership still encourages blind adherence to doctrine that can be harmful. But as Frob pointed out so well, at least you're getting what it is you're paying for. A wiser person would simply avoid virtual dispatch where it wasn't necessary.

Data-Oriented Design techniques, its worth noting, ought to positively impact cache efficacy for both data (increasing spacial/temporal locality, splitting data structures along usage lines) and code (sorting polymorphic collections such that all foo subclasses are processed at once, then bar, and so on).

There's performance to be gained for sure, but you ought to be fairly well off if you haven't done something painfully naive. I'd pretty much exhast optimizing D-cache behavior before examining I-cache behavior (though would design from the start with both in mind).

throw table_exception("(? ???)? ? ???");

The cache is like up front of data, wheather written to or red from, that your program performs over, on multiple levels of cached continuous data, up into the registers of CPU themselfs. To put simply what it exagerates for a thread on its exclusive read write memory, is that it does not realy perform write operation onto higher level if the higher level is still not red from, and does not read from higher level until reading cannot read in the level. If a cache miss occures, the entire level is written to higher level- since it is obvious it is in higher level already (notice that the memory is still not in RAM yet) and operation on memory tries to continue on higher level, if cache hit occurs, entire lower level is fullfilled with the localized memory and operation continues to registers, and after it is done in registers, moves to memory of next instruction the same way in its closest cache, and swaps them higher and higher untill cache hit.

For example, when I was young I measured my PC speed by an unintentionaly super cache friendly loop:

float* fs=new float[1000000];

for (int i=0;i<1000000;i++)

{

*(fs+i)=*(fs+i)*3.7f;

}

you don't wanna know how bestialy fast the 4MB of data is delt with (read and write)

Most important things have been said but ... You also need to ask your self what does the cpu do when a cash miss happens ? Short answer, most of them will just spin and wait; this results in waisted cpu cycles. The table below gives you a idea of those cycles and average time required on most modern desktop systems:

[attachment=24330:cpu-cach-info.png]

* Screen from a article I'm working on.

Tom Robbins: "There's a tendency today to absolve individuals of moral responsibility and treat them as victims of social circumstance." - Don't be that victim, take responsibility and be what you want to be.

This topic is closed to new replies.

Advertisement