# How to cache?

## Recommended Posts

JohnnyCode    1046

Hi.

I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me. My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every  individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:

cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often)

{

.........

........

}

Is this how it works?

Thanks a lot for any clarifications!

##### Share on other sites
HappyCoder    5052
When programming applications what you need to know about the cache is that reusing the same piece area of memory will give much better performance than jumping around in the memory. When you request something from memory there is bit of a process that goes on to retrieve that.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this

int array[4][4];

for (int y = 0; y < 4; ++y)
{
for (int x = 0; x < 4; ++x)
{
array[y][x] = x * y;
}
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

I don't know much about the other types of caches other than the same rules of locality apply. Also, using the stack when possible in c++ leads to better cache performance, this is because the same region of memory can be used between function calls and so the same parts of memory stay in the cache.

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.

##### Share on other sites
frob    44917
As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

A lot of experts have spent decades trying to get processors to automatically detect and prefetch your data. The processors do a wonderful job of it when you access your data sequentially.

That said, there are a few hardware-specific prefetch instructions available at a low level. Most programmers will never use them. They are generally used in extreme cases, and only when you have a fixed target of knowing exactly what processor you are using. Not "exact processor" as in, "I'm using an x86 chip", or "I'm using an i7 processor." They are only useful as a final level of optimization on an exact processor like "This will only run on the 4960X" or "This will only run on the ARM694E-S". Even then, they would only be used after nearly every other type of algorithmic and structural optimization has been exhausted. In games this typically means they are limited only to consoles with known chipsets.

##### Share on other sites
King Mir    2490

Hi.

I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me.

Caching means storing something when it's accessed, so that next time the same thing is used, you don't have to fetch it again.

The CPU cache caches data that's located in main memory on the CPU itself. That way, accessing the same memory location twice in a row, or near to it, is faster.

The GPU cache is the same, for the GPU.

You can implement a cache in memory, but it wouldn't gain you anything, because accessing the cache would be just as slow as accessing the thing being cached. You could of course implement a of something other than memory, and that might be helpful; for example, your web browser has a cache for visited webpages.

My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every  individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:

cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often)

{
.........
........
}

Is this how it works?
Thanks a lot for any clarifications!

So in the GPU there is actually not 1 cache but 2 caches (though this may vary between GPUs). The L1 cach is per core. This is a cache of a L2 per processor cache. When two cores want to access the same memory address at the same time, and one of them is a write, the GPU has to work to ensure that they both see a consistent view of memory. So it must propagate writes to the L2 cache. And when a core attempts access a memory location it must check that the value hasn't been changed on the L2 cache. If it has, then it has to get the new value from the L2 cache. If it hasn't then it can get the value from the L1 cache. That's why multiple cores writing to the same memory location is slower than when they don't share any data.

The CPU works the same way, except there are generally 3 levels of cache.

Note however that this only applies when at least one thread is writing to memory. Having multiple cores read from the same location is not a problem.

So the best way to ensure that the L1 cache is used efficiently is not to share mutable data between multiple cores. I presume you could either partition the data smartly between cores, or buffer the data so that a texture is not being written to and read from at the same time.

##### Share on other sites
King Mir    2490

This is all true, but is not related to the CPU cache at all, except that both do caching of a sort. Pages and page faults are a system that runs as software on the operating system, though the TLB does seem to be a cpu assisted part. The CPU cache is implemented entirely in hardware on the CPU (and the same may exist for the GPU).

Operating system paging can be viewed as a fully-associative cache of the page file, but the more common way of looking at is that the page file (or swap partition) is a backup to main memory.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this

int array[4][4];

for (int y = 0; y < 4; ++y)
{
for (int x = 0; x < 4; ++x)
{
array[y][x] = x * y;
}
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

Despite the inaccuracy of the above, this is good advice. The advantage of accessing data near to prior accesses applies to the CPU cache even more than it did to page files. (Paging is not really relevant to modern programming, unless you're implementing a database, or should be using one).

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.

That's not how it works, because the CPU caches are not fully associative. They may be non-associate or set associative. You can look those terms up for more info.

I don't know the associativity of GPUs. Edited by King Mir

##### Share on other sites
Hiwas    5807
No offense to anyone but GPU cache is something "you" don't optimize for in terms of shader code. Some interesting information has been posted but unless you are working on a specific console (fixed hardware) you simply don't have any way to really optimize your cache usage in shaders properly because each GPU is very different. But more importantly, the big cache items are way beyond your control and you simply can not "optimize" them. The big items in order of cost: texture lookup, stream data lookup (vertex/indice buffers) then the very small number of variables you use in your shaders and the shader code itself. (NOTE: Texture also means render targets including the current backbuffer.)

Textures are swizzled in memory for (usually) bilinear lookup, so the swizzles are typically at least a 2x2. Based on fetch sizes though, some use 4x4 and some use 8x8. So, trying to optimize your code to deal with 2x2 would degrade performance on a 4x4 swizzle card. So, in general, don't try to optimize this.

Geometry data is the next biggest hit, but it is an order of magnitude smaller than the textures. You "can" get good benefits by simply running a model through a generic 128 byte cache line optimizer. On smaller cache line GPU's, it still optimizes reasonably compared to raw unsorted data and on larger cacheline machines it reduces a significant number of hits. But.... This is actually pretty trivial in comparison to texture costs.

Your shaders. Other than the linear work they do per pixel, there is nearly no way to optimize them for cache coherency in such an environment. Of course most communication is through "registers" in the shader code (i.e. the limit on the number of items you can send to a shader) and as such, those are not cached at all, they are "in GPU" items and there simply is no cache to optimize for. So, other than reducing instruction counts and using best operations to reduce the GPU concept of cycle counts, you really can't "optimize" beyond supplying the smallest and most cycle efficient code.

Give me a CPU, I'll take what Happy posted and blow your mind. With a GPU, you are exceptionally limited because the GPU's vary so much, even when made by the same company, that your optimizations will likely die horribly on other GPU's.

##### Share on other sites
DareDeveloper    1018

Not sure if this is helpful at all, but Java's ThreadLocal came to mind when I read the SIMD stuff in the original post.

It kinda abstracts away the part where you link the thread id to the data for one thread. You have one ThreadLocal instance that all threads use, but get() returns an object for the current thread (making sure that threads don't share objects). Not sure if the same thing can be done with C++ templates or with whatever language you use ... but here is basically what the ThreadLocal needs:

T get()

Returns the value in the current thread's copy of this thread-local variable.

protected T initialValue()

void remove()

void set(T value)

Sets the current thread's copy of this thread-local variable to the specified value.

How does it work?

There is a map where the thread ids are linked to the value. The map is managed in the get() and remove() functions.

If there is no value when get() is called it uses initialValue() to create a new object and to store it in the map.

Not sure if this is what you are asking for or if that is even a valid solution when it comes to performance ... I'll just dump those thoughts here.

Guess it might not even work in some languages because you want to be able to extends ThreadLocal and overwrite initialValue() ... but that means mixing static and dynamic polymorphism?

Edited by DareDeveloper

##### Share on other sites
Satharis    2444
Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks. Of course you can stick to more specific optimization techniques, there's quite a few related to the GPU I'm not very experienced enough in to comment on.

Essentially it works the same way as in HappyCoder's array example in that the objects are allocated as arrays and are thus in contiguous blocks of memory. A good way to think of it is similar to how a hard drive would read data, that you want to prevent the read head from having to "jump" all over to different memory blocks in order to just read a small section.

Dealing with it on the GPU is a bit of a different story since it basically relates to how you allocate and send data to the video card and the relationship between it and the CPU.

In short it's not really a broad subject of one thing to do, more a bunch of smaller techniques for getting a bit more speed out of sections of code, critical ones obviously benefiting more due to repeat behavior experiencing exponential reduction in time if you optimize a single go around. Edited by Satharis

##### Share on other sites
JohnnyCode    1046

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about.

So, at the level of OS and compiler , nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

##### Share on other sites
Satharis    2444

So, at the level of OS and compiler, nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

It depends on the compiler of course but most of the repetoire of what a compiler does for optimization is a lot of little techniques for either reducing instruction count and redundancies or to help cache misses by aligning memory or instructions. There's a lot of techniques really that all come down to how the code is translated so it really isn't something the programmer themself deals with.

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

It's often performance debatable at best. Accessing memory from two different cores usually is not performance helpful because of two things:

Processor design: Processors vary a LOT in how they use cache, a lot use seperate L1 caches for each core and may use shared cache at L2 or L3, but there's nothing saying that HAS to be true. It's certainly possible to not use shared cache, it just often is done in order to save chip space. in the case of each processor having it's own cache that means even if they all accessed the same data they would -all- make a copy into their cache, thus being very redundant.

Shared cache itself: Shared cache can vary in its performance based on the type of operations being performed. Shared cache is often L2 or L3 cache so it is explicitly slower than the L1, so in most cases the core will get the most bang for the buck out of utilizing the non-shared cache anyway. Utilizing the shared cache can be slower than reaching out to RAM depending on how many cores are accessing a particular cache line and what operations they're doing. In general it's better to have each core working on a different piece of data or instruction in the shared cache anyway.

In terms of overall performance the biggest real way to increase performance is to align data in memory and then work in it in cache line sized segments. In that case you could theoretically have performance gains if you say, took a loop over a set of data and split into 4 smaller loops. But actually coding for something like this would be a bit of a waste of time. It may be a compiler optimization already, can vary based on the CPU and also would be meaningless in most cases unless it was done in extremely performance critical sections, to where the compiler is likely to make a lot of optimizations anyway. Edited by Satharis

##### Share on other sites
frob    44917

And, is accessing memory to read from by multiple threads all performance safe?
Thread safety for memory is accomplished through memory barriers.

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.

##### Share on other sites
King Mir    2490

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks.

I disagree that people writing performance critical code should not think about cache behavior. Quite the opposite, it can be the greatest bottleneck in a memory rich application. At the least one should be aware of you should avoid accessing disparate memory, and that cache lines are 64 bytes.

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

Accessing memory to read by multiple threads is safe and fast. In rare cases it can even lead to superlinear speedup, because one thread can pull a piece of data from memory to the L3 cache, that data can then be used by another thread. As long as no threads are writing to that memory. Reading the same piece of memory is safe because there is no synchronization; the data cannot change.

And, is accessing memory to read from by multiple threads all performance safe?

Thread safety for memory is accomplished through memory barriers.

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.

As per above, this is not true for constant data. If no thread is writing to shared data, then no synchronization is needed on that data. Edited by King Mir

##### Share on other sites
LorenzoGatti    4442

Apart from sequential memory accesses, there's a complementary way to take advantage of cache: doing as much computation as possible with the same data before loading other data.

For example, instead of looping (in the best possible sequential fashion) through an array of guided missiles to update their position, then again to compute collisions, then again to compact it after some missiles have exploded, then again to compute their acceleration from the guide system, then again to compute their new speed,  looping through them only twice is going to reduce loads and stores, and the ensuing cache misses, by 60%.

##### Share on other sites
King Mir    2490
If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

##### Share on other sites
fir    460

I had read somewhere that current systems had not one but many hardware prefetchers and you could (if I remember correctly) even experiment in disabling enabling them (probably in bios) to check it influence of performance - Do someone heremaybe  know something about it, could say something? (maybe to check if he has something like that iin bios? I got an old machine)

##### Share on other sites
Ravyne    14300

On the GPU its hard to really take any concrete action on the very low-level. For the most part, you can't rely on any set of GPUs behaving in the same way WRT caching because each vendor has their own secret sauce. Newer GPUs are becoming more homogenized across generations and even across vendors as GPU compute becomes more and more a factor in their hardware designs, but there's still quite a bit of variance today. I predict they will continue to trend closer and closer, but I doubt we'll ever see the kind of homogeneity that exists between, say, AMD and Intel processors.

What you can do instead is be less worried about specific caching behaviors, and more worried about the capabilities defined by different Direct3D versions -- D3D specifies the number of instruction slots, registers, and shared memory available, among other things. You also need to be aware of how GPU execution differs (e.g. branches execute both sides if even one element goes in each direction, results are masked and merged -- and this grows exponentially). Optimal GPU code loads in some memory that no other thread-group wants to touch at the same time, parties on it for a bit, writes back the result, and moves onto the next bit of memory.

##### Share on other sites
fir    460

As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

A couple of days before there was topic on that, i was asking mainly for two examples of prefetching that could

be expressed in such simple example

for(int i = 0; i < N; i++)

A[i*4] = B[i*3];

1) many interlaced acces arrays to prefetch - will this work ?

2) prefetching by stride      - will this work ?

I found some hints over the net which say

1) seems so that many interlaced streams can be prefetched (I do not know if there is some limit I suspect there is some)

2) prefetchers can do by stride

- (probably they do it in stride way that is they are able to get some bytes, skip gap, get some bytes, skip gap, not just all contigious memory (that stride acces was in the suggestions I was reading then about it - but if cache line to get is 64 bytes it would seem that will probably get at least 64 bytes  - i dont know)

[if so thinking about examples in those previous topic

can bring such outcome - when scanning full table of used and unused record prefetcher will prefetch it all -

this is more cache streem (used and unused records)

but no cache misses - in the second example when

reaching only for used records - it would be much slight

cache stream but I think prefetcher could not work

here - if so I could suspect that first example can be

better]

- they can not work when stride is big about 1000 or something , and in such case there would be a stall

thats what I found - it is at all important knowledge to consider imo if someone is interested in optimization

Edited by fir

##### Share on other sites
fir    460

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

##### Share on other sites
King Mir    2490

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes. Edited by King Mir

##### Share on other sites
fir    460

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes.

Does someone maybe know? - if cpu reads some byte (and it goes through cache) it always get such aligned 64B line (it is some bytes forward and some bytes backward usually?) ? That is when I read some byte may I assume that whole such aligned line (back and forth) is in cache?

##### Share on other sites
Ravyne    14300

Caching on the CPU generally works like this: The entire memory space is chopped up into 64-byte chunks conceptually -- whenever you read any byte within that chunk, the entire chunk is loaded up. So if you read the first byte of a chunk, all the bytes after it are loaded too. If you read the last byte, all the bytes before it are read. If you read from somewhere in the middle, all the bytes before and after are read. This is typical of L2 and L3 caches; I think L1 cache sometimes (still? used to?) operate with smaller cache-lines that were 16 or 32 bytes. Each cache 'line' corresponds to several chunks (total memory / total cache), so it can be the case that reading from some other memory location will sometimes evict a line from the cache prematurely; a direct-mapped cache would always evict because there's only one cache line to go around for all the memory locations it corresponds to, but an n-way set-associative has has n 'slots' (an abstraction of a cache line) so that nothing will be prematurely ejected as long as you aren't actively reading from n+1 or more memory locations simultaneously (most desktop systems today are 4-way set associative, which is sufficient for most access patterns.

On a GPU, caching for texture access is prioritized for spacial-locality (in texture coordinates) rather than linear access patterns. I'm honestly not sure what the state of automated caching is for GPU compute, but there's also a small block of memory for each cluster of SIMDs in a GPU that's under direct programmer control, which fulfills the role of cache too.