I've read in several places at this point that to get the most out of your CPU's cache, it's important to pack relevant data together, and access it in a linear fashion. That way the first access to that region of memory loads it into a cache line, and the remaining accesses will be much cheaper. One thing that isn't clear to me, however, is how many cache lines you can have "active" at once.
So for example, if you have a series of 3D vectors, and you lay them out like this:
[xxxx...]
[yyyy...]
[zzzz...]
And then you access your data as:
for (std::size_t i = 0; i < len; ++i) {
auto x_i = x[i];
auto y_i = y[i];
auto z_i = z[i];
// Do something with x, y, and z
}
Does each array get it's own cache line? Or does accessing the 'y' element push 'x' out of the cache, and then accessing the 'z' element push 'y' out of the cache? If you were to iterate backwards, would that cause more cache misses than iterating forwards?
On another note, while I try to follow best practices for this stuff where possible, I literally have zero idea how effective (or ineffective) it is, since I have no tools for profiling it, and I don't have time write one version of something and then test it against another. Are there any free (or free for students) tools for cache profiling on Windows? I'd love to use Valgrind, but I don't have anything that can run Linux that is also powerful enough to run my game.
Thanks!