SOA and cache locality

Started by
4 comments, last by swiftcoder 7 years, 2 months ago

I've read in several places at this point that to get the most out of your CPU's cache, it's important to pack relevant data together, and access it in a linear fashion. That way the first access to that region of memory loads it into a cache line, and the remaining accesses will be much cheaper. One thing that isn't clear to me, however, is how many cache lines you can have "active" at once.

So for example, if you have a series of 3D vectors, and you lay them out like this:


[xxxx...]
[yyyy...]
[zzzz...]

And then you access your data as:


for (std::size_t i = 0; i < len; ++i) {
    auto x_i = x[i];
    auto y_i = y[i];
    auto z_i = z[i];

    // Do something with x, y, and z
}

Does each array get it's own cache line? Or does accessing the 'y' element push 'x' out of the cache, and then accessing the 'z' element push 'y' out of the cache? If you were to iterate backwards, would that cause more cache misses than iterating forwards?

On another note, while I try to follow best practices for this stuff where possible, I literally have zero idea how effective (or ineffective) it is, since I have no tools for profiling it, and I don't have time write one version of something and then test it against another. Are there any free (or free for students) tools for cache profiling on Windows? I'd love to use Valgrind, but I don't have anything that can run Linux that is also powerful enough to run my game.

Thanks!

Advertisement

The CPU will have a particular cache line size, an L1 cache for each core, an L2 cache (probably shared between cores) and possibly even an L3 cache.

e.g. the Intel i7 6850K has 64-byte sized cache lines, a 64KiB L1 cache per core, 256KiB L2 cache per core, and 15MiB of shared L3 cache.

When you read from a pointer, the lower 6 bits are set to zero, which has the effect of rounding down to the nearest multiple of 64 (64 is 1000000b). The CPU then starts reading 64 bytes starting from that address into the L1 cache (which will in turn try to read the data from the L2 cache, which will try to read it from the L3 cache, which will read it from RAM).

Because there's 64KiB of L1 cache and a cache-line takes up 64 bytes, you can fit 1024 lines into the L1 cache (and 4096 lines into L2, and ~a quarter million lines in L3).

The way that a CPU decides where to put each line in the cache is called associativity.

One simple system is to simply divide up RAM addresses by the cache size and use the remainder as the address. e.g if the cache size was 1000, and you had items in RAM at the addresses a=1337, b=2004, c=1300001, then their local cache addresses would be a=337, b=4, c=1. This can lead to "conflicts" -- e.g. if a=1042 and b=99042, then they both result in the local cache address of 42! If you used a and then b, then b would evict a from the cache (and next time you use a it would evict b from the cache).

There's also many other (smarter) ways that CPU's can translate between local cache addresses and RAM addresses...

Iterating backwards and forwards should have similar performance. The CPU has a predictive prefetcher that will try to guess your memory access patterns (e.g. "iterating linearly forwards") and starts moving data from RAM into cache before you even ask for it. Both "iterating linearly forwards" and "iterating linearly backwards" are common patterns that they will look for.

On less smart CPU's, programmers had to put special "prefetch" instructions into their code to let the CPU know what data was going to be used in the near future. You still sometimes see these used on modern desktop CPU's, but it's almost always unneeded these days.

If you're curious about your own system then you can check your cache sizes with a program called CPU-Z:

http://www.cpuid.com/softwares/cpu-z.html

The address mapping conflict with cache lines that Hodgman talked about is called "cache thrashing" in case you want to do more research on it. It's not something that you need to worry about under normal circumstances, but it's good to be aware of it and understand how to deal with it in case you do encounter a performance issue. Generally locality (keep related data close together in memory) and predictable access patterns are enough to keep things running smoothly. In a contiguous container the size of the elements also begins to matter since it determines how many elements will fit in a cache line.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

In your particular case however your access in that loop is not cache friendly because the arrays are stored after each other if declared and created at the same time. Had you looped over all x's then y's and then z's you would have had a cache friendly access. The trick is to layout your data in such a way that when you need to access that data you access all/most of it in the same loop.

Herb has done a talk about the benefit of looping in order through your data and has some slides that show what happens an why. ( https://channel9.msdn.com/Events/Build/2014/2-661 there is more in this talk than just that.) https://view.officeapps.live.com/op/view.aspx?src=http%3a%2f%2fvideo.ch9.ms%2fsessions%2fbuild%2f2014%2f2-661.pptx

Ps: on most PC hardware the cache line size is 64-bytes.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

The structure of arrays is ONE useful solution, but like every solution it can be misapplied.

The fundamental issue is how the data is accessed. The pattern matches one efficient way to access the data but it is not the only way, and if your access pattern doesn't match the data layout it will backfire.

Does each array get it's own cache line?

A cache line has been 64-bytes for many years now. It will likely eventually become 128 bytes.

Or does accessing the 'y' element push 'x' out of the cache, and then accessing the 'z' element push 'y' out of the cache?

Probably not. Modern processors have large caches, but the inner workings of how they load and evict contents is an implementation detail. There is no guarantee that they work any particular way.

There is a prediction system that can evaluate the current memory accesses and the memory addresses coming through the pipeline (about 30 instructions currently). It can start prefetching your data quite early.

Object sizes still play a role. If your data is not at a 64-byte stride or something easily predictable (96 bytes, 128 bytes, 160 bytes, etc) then it cannot be readily predicted. If it fits on the same cache line the processing is very nearly free. If it doesn't fit on the cache line but the prefetcher is able to predict it the cost is more than it being in L1 but far less than a full cache miss.

It is the cache prediction that gives the big speed advantage in this case. If it cannot predict what is next there will probably be a cache miss.

If you were to iterate backwards, would that cause more cache misses than iterating forwards?

Cache predictor is usually good about those things.

On another note, while I try to follow best practices for this stuff where possible, I literally have zero idea how effective (or ineffective) it is, since I have no tools for profiling it, and I don't have time write one version of something and then test it against another. Are there any free (or free for students) tools for cache profiling on Windows?

AMD's profiler CodeXL can do it. It was made open source, freely available on github. If you haven't used a profiler before you will likely have a bit of a learning curve, but there are many online guides.

Be aware that a structure of arrays really only helps when you are accessing your data sequentially as data streams. Having SoA doesn't provide any benefit if you are jumping around the array, or if you are using individual elements as objects, in which case the array of structures can be better because you're touching all those data members instead. The key detail is that whatever format you have for the data, it should be processed as a regular interval to enable prediction and tightly packed sequentially so multiple data read values span a single cache line.

It's worth keeping in mind that one usually applies cache optimisations very late in the optimisation process, after one has made all the algorithmic improvements one can.

You usually have a very clear picture of your data access patterns by that point (and they aren't going to change very much in the future). Then and only then should you start applying things like SoA, and only where it matches the actual data access pattern.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement