Hi.
I have trouble understanding how cache is implemented. I understand all low level aspects of digital computations, yet this is totaly out of my scope. Is there a way to implement cache out of cpu management? Like, if there is a part of memory that is going to get accesed to read from, how would I cache it? The question is not wheather it would benefit me, the question is why it would not benefit me.
Caching means storing something when it's accessed, so that next time the same thing is used, you don't have to fetch it again.
The CPU cache caches data that's located in main memory on the CPU itself. That way, accessing the same memory location twice in a row, or near to it, is faster.
The GPU cache is the same, for the GPU.
You can implement a cache in memory, but it wouldn't gain you anything, because accessing the cache would be just as slow as accessing the thing being cached. You could of course implement a of something other than memory, and that might be helpful; for example, your web browser has a cache for visited webpages.
My more precise problem is GPU cache. There are multiple SIMD (single instruction multiple data) threads running at my side, but they many times access the same memory to read from. This couses huge stalks if the memory is accessed by more SIMD threads. So my understanding is that cache is a memory of every individual SIMD thread that gets populated with memory that SIMD thread is likely to access. Is it true? My problem is, how to make sure that every SIMD reads from its own spot in memory. Only way I can distinguish between SIMD thread is index of pixel they are computing, this differs fo r every SIMD thread, running or going to run. My only idea of implementing cache for those threads would be:
cache=new [numofthreads*cachesize]<=data likely to read by all paralel threads (say 80, that means 80 identical data often)
thread
{
.........
var some=cache[threadID][particulardata];
........
}
Is this how it works?
Thanks a lot for any clarifications!
So in the GPU there is actually not 1 cache but 2 caches (though this may vary between GPUs). The L1 cach is per core. This is a cache of a L2 per processor cache. When two cores want to access the same memory address at the same time, and one of them is a write, the GPU has to work to ensure that they both see a consistent view of memory. So it must propagate writes to the L2 cache. And when a core attempts access a memory location it must check that the value hasn't been changed on the L2 cache. If it has, then it has to get the new value from the L2 cache. If it hasn't then it can get the value from the L1 cache. That's why multiple cores writing to the same memory location is slower than when they don't share any data.
The CPU works the same way, except there are generally 3 levels of cache.
Note however that this only applies when at least one thread is writing to memory. Having multiple cores read from the same location is not a problem.
So the best way to ensure that the L1 cache is used efficiently is not to share mutable data between multiple cores. I presume you could either partition the data smartly between cores, or buffer the data so that a texture is not being written to and read from at the same time.