How to cache?

Started by
19 comments, last by Ravyne 10 years, 7 months ago


And, is accessing memory to read from by multiple threads all performance safe?
Thread safety for memory is accomplished through memory barriers.

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.

Advertisement

Actually helping the CPU utilize cached instructions, as has been pointed out, is pretty hard. It's not often something you actively think about. In terms of game development the farthest engines tend to go with it is things like changing how memory is allocated in order to create contiguous blocks.

I disagree that people writing performance critical code should not think about cache behavior. Quite the opposite, it can be the greatest bottleneck in a memory rich application. At the least one should be aware of you should avoid accessing disparate memory, and that cache lines are 64 bytes.

And, is accessing memory to read from by multiple threads all performance safe? And if so why? Is there a point in making every thread have its own memory pool populated with even duplicated virtual RAM or not?

Accessing memory to read by multiple threads is safe and fast. In rare cases it can even lead to superlinear speedup, because one thread can pull a piece of data from memory to the L3 cache, that data can then be used by another thread. As long as no threads are writing to that memory. Reading the same piece of memory is safe because there is no synchronization; the data cannot change.

And, is accessing memory to read from by multiple threads all performance safe?

Thread safety for memory is accomplished through memory barriers.

It is one of many issues involved in threaded programming. You must establish a memory barrier before modifying any shared variable. This can be accomplished with interlocked functions which have great performance, or through more complex locks which require additional CPU time.

As per above, this is not true for constant data. If no thread is writing to shared data, then no synchronization is needed on that data.

Apart from sequential memory accesses, there's a complementary way to take advantage of cache: doing as much computation as possible with the same data before loading other data.

For example, instead of looping (in the best possible sequential fashion) through an array of guided missiles to update their position, then again to compute collisions, then again to compact it after some missiles have exploded, then again to compute their acceleration from the guide system, then again to compute their new speed, looping through them only twice is going to reduce loads and stores, and the ensuing cache misses, by 60%.

Omae Wa Mou Shindeiru

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

I had read somewhere that current systems had not one but many hardware prefetchers and you could (if I remember correctly) even experiment in disabling enabling them (probably in bios) to check it influence of performance - Do someone heremaybe know something about it, could say something? (maybe to check if he has something like that iin bios? I got an old machine)

On the GPU its hard to really take any concrete action on the very low-level. For the most part, you can't rely on any set of GPUs behaving in the same way WRT caching because each vendor has their own secret sauce. Newer GPUs are becoming more homogenized across generations and even across vendors as GPU compute becomes more and more a factor in their hardware designs, but there's still quite a bit of variance today. I predict they will continue to trend closer and closer, but I doubt we'll ever see the kind of homogeneity that exists between, say, AMD and Intel processors.

What you can do instead is be less worried about specific caching behaviors, and more worried about the capabilities defined by different Direct3D versions -- D3D specifies the number of instruction slots, registers, and shared memory available, among other things. You also need to be aware of how GPU execution differs (e.g. branches execute both sides if even one element goes in each direction, results are masked and merged -- and this grows exponentially). Optimal GPU code loads in some memory that no other thread-group wants to touch at the same time, parties on it for a bit, writes back the result, and moves onto the next bit of memory.

throw table_exception("(? ???)? ? ???");

As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.

A couple of days before there was topic on that, i was asking mainly for two examples of prefetching that could

be expressed in such simple example

for(int i = 0; i < N; i++)

A[i*4] = B[i*3];

1) many interlaced acces arrays to prefetch - will this work ?

2) prefetching by stride - will this work ?

I found some hints over the net which say

1) seems so that many interlaced streams can be prefetched (I do not know if there is some limit I suspect there is some)

2) prefetchers can do by stride

- (probably they do it in stride way that is they are able to get some bytes, skip gap, get some bytes, skip gap, not just all contigious memory (that stride acces was in the suggestions I was reading then about it - but if cache line to get is 64 bytes it would seem that will probably get at least 64 bytes - i dont know)

[if so thinking about examples in those previous topic

can bring such outcome - when scanning full table of used and unused record prefetcher will prefetch it all -

this is more cache streem (used and unused records)

but no cache misses - in the second example when

reaching only for used records - it would be much slight

cache stream but I think prefetcher could not work

here - if so I could suspect that first example can be

better]

- they can not work when stride is big about 1000 or something , and in such case there would be a stall

thats what I found - it is at all important knowledge to consider imo if someone is interested in optimization

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.

I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.


I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes.

If were giving out general cache performance tips, here's a good one from Andrei Alexandrescu given at Going Native on Wednesday: organize your class data members by hotness, so that the hottest data is withing the first 64 bytes of the class.


I also wonder if they should be not aligned to 64 bytes sometimes.. Is this 64B cache line loaded aligned one or just 64B long unaligned one?

Yes it does need to be 64 Byte aligned. So you want padding after large classes to make them divisible by 64 bytes.

Does someone maybe know? - if cpu reads some byte (and it goes through cache) it always get such aligned 64B line (it is some bytes forward and some bytes backward usually?) ? That is when I read some byte may I assume that whole such aligned line (back and forth) is in cache?

This topic is closed to new replies.

Advertisement