Cache optimisations for beginners

Started by
6 comments, last by Matt_D 14 years, 9 months ago
I posted a question recently asking about ways to improve the performance of some fluid physics I'm working on. I was looking for improvements in the algorithm, but it was suggested to me that I should also look into optimising my code with regards to the CPU cache, and avoiding cache misses. I'm an experienced C++ programmer, but writing stuff that's as performance-critical as complex physics code is somewhat new to me, and it struck me that I don't really know where to begin with making such optimisations, so I thought I'd throw the question out there. I know the really basic general stuff, such as "try to keep stuff you'll be processing together as close as possible in memory", "store the results of calculations or pointer lookups locally if you'll be using them a few times", and I know the difference between the "good" way to iterate through a 2-dimensional array and the "bad" way, but where else can I go from here? - Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered) - What are the most common causes of poor cache behaviour, and how can they be rectified? - I'm led to believe that caching works best when dealing with structures that are 2^n in size. Is there an upper-limit to this size, or any other stuff I should know about that? - Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject? I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?
Advertisement
Memory caches are built around two assumptions: locality of space and locality of time. The first means that if a program accesses one memory location it's very likely that it will access memory locations near by. The second means that if a program access one memory location it's likely to access that same memory location again in the near future. The more your program behavior matches these assumptions, the more efficient memory caching will be. To improve spatial locality, you can do things like using sequential memory containers rather than node based containers (ex: using std::vector rather than std::list) and use values rather than pointers when you can. To improve temporal locality you do things like processing everything you need on A before moving on to B, C ... ZZZ rather than do f(A), f(B) ... f(ZZZ) then g(A), g(B) ... g(ZZZ), and so on.

Some profilers include information about cache misses such as VTune and AMD CodeAnalyst. I don't recall if the MSVC 2008 profiler reports that information, but that's only available if you're using one of the team editions.
Quote:Original post by LemonScented
I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?
My understanding is that cache optimisation is largely guess work combined with rules of thumb.

You can do things like using contiguous sequences of data to maximise spatial locality and minimising the amount of jumping about you do (reduce pointers/references, reduce random access into other arrays/structs and transform recursive functions into iterative ones). Likewise if you're iterating over a sequence of data then try and perform as much of the processing per iterative step (rather than repeat iterations) to benefit from temporal locality optimisations. SiCrane descibed both of these already.

You can also try fiddling the code-wise arrangement of data members to benefit from better packing. You can also separate frequently used data members into their own contiguous blocks separate from their less frequently used contemporaries for improved spatial locality.
The big thing is really just to be aware of the cache, in light of your data access patterns.

A typical cache line is 64 bytes in length for an X86 CPU -- what this means is that whenever you hit main memory, all of that 64-byte "page" will be in the L1 cache, giving you very fast access -- so what you want to do is cram as much of the necessary information as possible into that 64-bytes -- this requires both appropriately-sized data and appropriately aligned data. Be aware, however, that some CPUs (particularly older models or different architectures) may have different cache line sizes.

Recent Intel CPUs (Pentium D, Core, Core 2) are 8-way/16-way set associative (L1 and L2 caches respectively) which means that it has 8 entries per 64-byte "page". You may be wondering why this is, and the reason is that each of these 8-element sets aligns to several 64-byte pages in main memory -- 8-way means that the CPU caches 8 pages from main memory which align on the same cache tag. The more "ways" a CPU cache has, the less frequently that line will be evicted from the cache -- but the effectiveness of a particular caching setup can't be judged by numbers alone, much of it has to do with overall cache size, CPU and memory subsystems, latencies, and typical workloads. Just assume that your CPU manufacturer went through a lot of trouble deciding on the optimal setup.

Then, there's cache aliasing. Aliasing is the frequency with which "pages" in main memory map to sets within the cache. Taking a typical L1 cache size of 32K (recent Intel CPUs meantioned above) We divide by the "ways" of the cache, 8 in case of the L1, to arrive at 4K -- what this means is that for every 64-byte page in main memory which maps to a particular set, each page which is an integer multiple of 4096 bytes away also maps to that set, and that this holds true for the entirety of main memory installed in the system. What this means in practical terms, is that if you're accessing data aligned on these aliasing bounds more than the number of "ways" your CPU cache has before you're done using those pages, you'll pre-maturely evict the cache lines. This is sometimes called "thrashing" the cache, though that term is also applied more loosely a general mis-use of the cache.

Data structures that are a power of two in size are good -- though that's less in-and-of-itself, and more because cache line sizes are powers of two in size, the real issue is that you want your data structures to fit nicely in the cache -- If cache lines were 80 bytes, then good structure sizes would be 2, 4, 5, 8, 10, 16, 20, 40, and 80 bytes. Practically speaking, it would be very esoteric for a CPU to adopt a cache line that was not a power of two -- I simply use the example of 80 bytes as an illustration for the reasoning behind powers of two being good in real systems.

One more "advanced" technique for designing your data structures is to divide them into "hot" and "cold" data sets. This is beneficial when the structure consists of some data which is frequently accessed in performance critical areas (the hot set), and some which is rarely accessed, or accessed in non-performance-critical areas (the cold set). It works best when the "hot" data, plus an additional pointer to the cold data (or just the hot data if the cold data can be organized in a separate structure) can be made to fit the cache better. For example, say you have a structure of hot and cold data that is 10 32-bit integers, or 40 bytes -- You're only able to fit 1.6 of these structures per cache-line. If we are able to identify 3 "hot" integers, and add a 32-bit pointer to a structure containing the "cold" integers, then our hot structure is only 16 bytes -- we can now fit exactly 4 of these structures into a cache line.

I'd really recommend you read up on the details of how caches work to gain insight on how to structure your code and data access patterns. Some concepts you should try to understand in some depth are: Associativity, Aliasing, Tagging, Line-Size, Exclusive vs. Inclusive, the relationship between cache levels L1<->L2<->L3, and the relationship between the highest-level cache and main-memory.

Finally, be aware that caching setups do vary, for example, AMD Athlon X2 CPUs have 64k, 2-Way set associative L1 caches. The more general rules-of-thumb mentioned above (spacial and temporal locality of reference) ought to give good results across all platforms -- I wouldn't attempt to optimize aggressively for one particular caching setup unless your platform is predictable (console, handheld, embedded system development), or you are optimizing towards a range of CPUs that have similar caching setups.

throw table_exception("(? ???)? ? ???");

Always optimize by choosing better algorithms, first. It's only when you KNOW you have the fastest possible algorithms in place, that you begin to worry about this stuff.

There are two stages to this process. One is done by touch and feel - put your data close together (aka, arrays for many things), and put the references to the data close together (avoid hopping all over the array where possible). Sort data by frequency of reference instead of other keys (experimentally: it doesn't always help).

Rethink memory allocation - try to allocate what you need up front, and avoid a lot of free/allocate cycles, as that's a cache-killer. Order stack variables in a function from most used to least used, but push large arrays to the bottom. Don't recurse when you can iterate. Arrange the data elements on structs and classes in decreasing size order (doubles and long long at the top, then int, then short, and so on) to get tight but aligned packing, in the hopes of fitting it all in cache. The previous poster mentioned segregating hot and cold data - this can be a huge win, but make very sure you know which is cold or you will make things much worse. Use unsigned char or short when you need to store small integers; don't use double if float will do. Fuss with inlining; you don't want to be constantly hopping from function to function, but if you overdo it you'll end up with functions that don't fit in cache, and hurt yourself. Read the assembler listings - optimizers DO make mistakes and sometimes rearranging code will do impressive shrinks in the number of instructions needed.

It's all small stuff, but it can add up to a measurable improvement. Measure after every change.

Sometimes you do counterintuitive things. I remember AMD recommending doing tricks with asynch reads to "warm the processor cache ahead of use"; it worked, but it was so ugly I wished it didn't.

If that doesn't get you what you need, the next stage is specialized tools for your environment, and it gets hard. Hope you never have to go there. Sometimes it's just cheaper to move up in hardware, if you have that choice.
Quote:Original post by LemonScented
fluid physics


Since this is one of the problems people have been solving for a very long time, it may make sense to look into an existing library. In general, linear algebra and related topics have been optimized to oblivion already.

Quote:- Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered)


AMD CodeAnalyst and Intel VTune.

Quote:- I'm led to believe that caching works best when dealing with structures that are 2^n in size.


In general no. The unit of measure is cache line, 64 bytes. Alignment however does matter, which is why structures are often padded or designed to fit nicely into cache lines.

Quote:- Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject?


I found this overview (PPT) to cover the problem domain well.
If you know your access patterns then you should prefetch your data before using it.

E.g.

for(unint i = 0; i < abignumber; ++i) {
prefetch(myarray + CACHE_LINE_SIZE);

}
Graphics Programmer - Ready At Dawn Studios
1 - before doing anything, profile
2- after doing anything, profile
3- only do one thing at once
4- only fix things that show up in the profile.

so, you may not actually have a problem at this point, check your profile output first.
the profile will also show up exactly what type of cache problems you have (if you have any).

algorithmic changes will work far better than small instruction based changes.

data layout and access patterns will be the biggest target area for cache problems.

Matt D
your never as good as they say you were, never as bad as they say you was.

This topic is closed to new replies.

Advertisement