How to cache?

Started by
19 comments, last by Ravyne 10 years, 7 months ago

Hi.

I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me. My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:

cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often)

thread

{

.........

var some=cache[threadID][particulardata];

........

}

Is this how it works?

Thanks a lot for any clarifications!

Advertisement
When programming applications what you need to know about the cache is that reusing the same piece area of memory will give much better performance than jumping around in the memory. When you request something from memory there is bit of a process that goes on to retrieve that.

Lets say you started up the application and requested memory at 0x2000, lets say this memory has not been requested yet. First of all the cpu doesn't actually access the memory address 0x2000. Instead it is translated into a virtual address. I wont go into a lot of detail but I will try to briefly explain how the translation happens. The first time you request the memory the cpu will see that a page of memory has not been assigned to that memory range, so the operating system finds a free page of memory, maps it to the address region you requested and the program continues on. This process is known as a page fault. Everytime memory is accessed using its address, this address is actually the virtual address. The virtual address needs to be translated to a physical address to be requested from the ram. There are caches specifically used to store information about how to translate virtual addresses. That cache is called the translation lookaside buffer. Everytime the program has a page fault it actually sets a aside an entire region of memory the requested memory lies in. The TLB cannot store all of the virtual memory translation tables so only the most recently used are cached.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this


int array[4][4];

for (int y = 0; y < 4; ++y)
{
  for (int x = 0; x < 4; ++x)
  {
     array[y][x] = x * y;
  }
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

I don't know much about the other types of caches other than the same rules of locality apply. Also, using the stack when possible in c++ leads to better cache performance, this is because the same region of memory can be used between function calls and so the same parts of memory stay in the cache.

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.
My current game project Platform RPG
As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

A lot of experts have spent decades trying to get processors to automatically detect and prefetch your data. The processors do a wonderful job of it when you access your data sequentially.

That said, there are a few hardware-specific prefetch instructions available at a low level. Most programmers will never use them. They are generally used in extreme cases, and only when you have a fixed target of knowing exactly what processor you are using. Not "exact processor" as in, "I'm using an x86 chip", or "I'm using an i7 processor." They are only useful as a final level of optimization on an exact processor like "This will only run on the 4960X" or "This will only run on the ARM694E-S". Even then, they would only be used after nearly every other type of algorithmic and structural optimization has been exhausted. In games this typically means they are limited only to consoles with known chipsets.

Hi.

I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me.

Caching means storing something when it's accessed, so that next time the same thing is used, you don't have to fetch it again.

The CPU cache caches data that's located in main memory on the CPU itself. That way, accessing the same memory location twice in a row, or near to it, is faster.

The GPU cache is the same, for the GPU.

You can implement a cache in memory, but it wouldn't gain you anything, because accessing the cache would be just as slow as accessing the thing being cached. You could of course implement a of something other than memory, and that might be helpful; for example, your web browser has a cache for visited webpages.

My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:

cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often)

thread
{
.........
var some=cache[threadID][particulardata];
........
}

Is this how it works?
Thanks a lot for any clarifications!

So in the GPU there is actually not 1 cache but 2 caches (though this may vary between GPUs). The L1 cach is per core. This is a cache of a L2 per processor cache. When two cores want to access the same memory address at the same time, and one of them is a write, the GPU has to work to ensure that they both see a consistent view of memory. So it must propagate writes to the L2 cache. And when a core attempts access a memory location it must check that the value hasn't been changed on the L2 cache. If it has, then it has to get the new value from the L2 cache. If it hasn't then it can get the value from the L1 cache. That's why multiple cores writing to the same memory location is slower than when they don't share any data.

The CPU works the same way, except there are generally 3 levels of cache.

Note however that this only applies when at least one thread is writing to memory. Having multiple cores read from the same location is not a problem.

So the best way to ensure that the L1 cache is used efficiently is not to share mutable data between multiple cores. I presume you could either partition the data smartly between cores, or buffer the data so that a texture is not being written to and read from at the same time.

Lets say you started up the application and requested memory at 0x2000, lets say this memory has not been requested yet. First of all the cpu doesn't actually access the memory address 0x2000. Instead it is translated into a virtual address. I wont go into a lot of detail but I will try to briefly explain how the translation happens. The first time you request the memory the cpu will see that a page of memory has not been assigned to that memory range, so the operating system finds a free page of memory, maps it to the address region you requested and the program continues on. This process is known as a page fault. Everytime memory is accessed using its address, this address is actually the virtual address. The virtual address needs to be translated to a physical address to be requested from the ram. There are caches specifically used to store information about how to translate virtual addresses. That cache is called the translation lookaside buffer. Everytime the program has a page fault it actually sets a aside an entire region of memory the requested memory lies in. The TLB cannot store all of the virtual memory translation tables so only the most recently used are cached.

This is all true, but is not related to the CPU cache at all, except that both do caching of a sort. Pages and page faults are a system that runs as software on the operating system, though the TLB does seem to be a cpu assisted part. The CPU cache is implemented entirely in hardware on the CPU (and the same may exist for the GPU).

Operating system paging can be viewed as a fully-associative cache of the page file, but the more common way of looking at is that the page file (or swap partition) is a backup to main memory.

So the takeaway of what I was talking about above is to do your calculations in localized area of memory. This means you favor allocating things together in continuous memory if possible instead of allocating separate blocks. Also, when looping over a 2d array, be sure you loop over the rightmost index in the inner loop. like this

int array[4][4];

for (int y = 0; y < 4; ++y)
{
  for (int x = 0; x < 4; ++x)
  {
     array[y][x] = x * y;
  }
}
Do not flip x and y. With large data sets, you will generate lots of page faults and will greatly slow down your code.

Despite the inaccuracy of the above, this is good advice. The advantage of accessing data near to prior accesses applies to the CPU cache even more than it did to page files. (Paging is not really relevant to modern programming, unless you're implementing a database, or should be using one).

As for how the cache is implemented? How I understand it is for each cache element it stores the address, the value stored at the address, and information about its status and other information needed to function. When an address is requested the address is requested from the cache and every cache line simultaneously checks to see if it stores that address, if it does then it outputs that is has the address as well as the value. All of the outputs of the cache lines are ORed together, but because only one line will match the address, that single value will come back. I don't think that is a very clear explanation and I don't really know much details of how it works, but I hope that somewhat answers your question.

That's not how it works, because the CPU caches are not fully associative. They may be non-associate or set associative. You can look those terms up for more info.

I don't know the associativity of GPUs.
No offense to anyone but GPU cache is something "you" don't optimize for in terms of shader code. Some interesting information has been posted but unless you are working on a specific console (fixed hardware) you simply don't have any way to really optimize your cache usage in shaders properly because each GPU is very different. But more importantly, the big cache items are way beyond your control and you simply can not "optimize" them. The big items in order of cost: texture lookup, stream data lookup (vertex/indice buffers) then the very small number of variables you use in your shaders and the shader code itself. (NOTE: Texture also means render targets including the current backbuffer.)

Textures are swizzled in memory for (usually) bilinear lookup, so the swizzles are typically at least a 2x2. Based on fetch sizes though, some use 4x4 and some use 8x8. So, trying to optimize your code to deal with 2x2 would degrade performance on a 4x4 swizzle card. So, in general, don't try to optimize this.

Geometry data is the next biggest hit, but it is an order of magnitude smaller than the textures. You "can" get good benefits by simply running a model through a generic 128 byte cache line optimizer. On smaller cache line GPU's, it still optimizes reasonably compared to raw unsorted data and on larger cacheline machines it reduces a significant number of hits. But.... This is actually pretty trivial in comparison to texture costs.

Your shaders. Other than the linear work they do per pixel, there is nearly no way to optimize them for cache coherency in such an environment. Of course most communication is through "registers" in the shader code (i.e. the limit on the number of items you can send to a shader) and as such, those are not cached at all, they are "in GPU" items and there simply is no cache to optimize for. So, other than reducing instruction counts and using best operations to reduce the GPU concept of cycle counts, you really can't "optimize" beyond supplying the smallest and most cycle efficient code.

Give me a CPU, I'll take what Happy posted and blow your mind. With a GPU, you are exceptionally limited because the GPU's vary so much, even when made by the same company, that your optimizations will likely die horribly on other GPU's.

Not sure if this is helpful at all, but Java's ThreadLocal came to mind when I read the SIMD stuff in the original post.

It kinda abstracts away the part where you link the thread id to the data for one thread. You have one ThreadLocal instance that all threads use, but get() returns an object for the current thread (making sure that threads don't share objects). Not sure if the same thing can be done with C++ templates or with whatever language you use ... but here is basically what the ThreadLocal needs:

T get()

Returns the value in the current thread's copy of this thread-local variable.

protected T initialValue()

Returns the current thread's "initial value" for this thread-local variable.

void remove()

Removes the current thread's value for this thread-local variable.

void set(T value)

Sets the current thread's copy of this thread-local variable to the specified value.

How does it work?

There is a map where the thread ids are linked to the value. The map is managed in the get() and remove() functions.

If there is no value when get() is called it uses initialValue() to create a new object and to store it in the map.

Not sure if this is what you are asking for or if that is even a valid solution when it comes to performance ... I'll just dump those thoughts here.

Guess it might not even work in some languages because you want to be able to extends ThreadLocal and overwrite initialValue() ... but that means mixing static and dynamic polymorphism?

Given enough eyeballs, all mysteries are shallow.

MeAndVR

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks. Of course you can stick to more specific optimization techniques, there's quite a few related to the GPU I'm not very experienced enough in to comment on.

Essentially it works the same way as in HappyCoder's array example in that the objects are allocated as arrays and are thus in contiguous blocks of memory. A good way to think of it is similar to how a hard drive would read data, that you want to prevent the read head from having to "jump" all over to different memory blocks in order to just read a small section.

Dealing with it on the GPU is a bit of a different story since it basically relates to how you allocate and send data to the video card and the relationship between it and the CPU.

In short it's not really a broad subject of one thing to do, more a bunch of smaller techniques for getting a bit more speed out of sections of code, critical ones obviously benefiting more due to repeat behavior experiencing exponential reduction in time if you optimize a single go around.

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about.

So, at the level of OS and compiler , nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

So, at the level of OS and compiler, nothing is done towards this maintenance, except "as close memory acceces at a time as possible"?

It depends on the compiler of course but most of the repetoire of what a compiler does for optimization is a lot of little techniques for either reducing instruction count and redundancies or to help cache misses by aligning memory or instructions. There's a lot of techniques really that all come down to how the code is translated so it really isn't something the programmer themself deals with.

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

It's often performance debatable at best. Accessing memory from two different cores usually is not performance helpful because of two things:

Processor design: Processors vary a LOT in how they use cache, a lot use seperate L1 caches for each core and may use shared cache at L2 or L3, but there's nothing saying that HAS to be true. It's certainly possible to not use shared cache, it just often is done in order to save chip space. in the case of each processor having it's own cache that means even if they all accessed the same data they would -all- make a copy into their cache, thus being very redundant.

Shared cache itself: Shared cache can vary in its performance based on the type of operations being performed. Shared cache is often L2 or L3 cache so it is explicitly slower than the L1, so in most cases the core will get the most bang for the buck out of utilizing the non-shared cache anyway. Utilizing the shared cache can be slower than reaching out to RAM depending on how many cores are accessing a particular cache line and what operations they're doing. In general it's better to have each core working on a different piece of data or instruction in the shared cache anyway.

In terms of overall performance the biggest real way to increase performance is to align data in memory and then work in it in cache line sized segments. In that case you could theoretically have performance gains if you say, took a loop over a set of data and split into 4 smaller loops. But actually coding for something like this would be a bit of a waste of time. It may be a compiler optimization already, can vary based on the CPU and also would be meaningless in most cases unless it was done in extremely performance critical sections, to where the compiler is likely to make a lot of optimizations anyway.

This topic is closed to new replies.

Advertisement