Understanding Cache Utilization

Started by
29 comments, last by Khatharr 10 years, 7 months ago

The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

Code from the third link is here.

Ex. 1:


struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:


struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first. I don't really understand why.

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache. If the object is not alive, you flush the cache and carry on.

With the second example, it's inverted. Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects. Do these examples just assume that knowledge, or am I missing something about caches here?

Advertisement

It's something like this:

In the first case dead objects are still loaded into the cache. For large objects this can be time consuming.

In the second case dead objects are not loaded into the cache.

In both cases, if the object is alive you have to load the cache with the object if it's not already there. Otherwise it's already in the cache.

The real benefit, therefore, is in the case of checking against dead objects. You load dead objects in case 1 and don't load them in case 2 (as the worst case scenario).

In the first example you'll always cache, which can result in a high amount of cache misses(due to objects being dead). In the second example, the amount of cache misses will most likely be less. Also note that your object size is smaller in the second example, which allows for more objects to be cached, resulting into less cache misses.

Let's say I have a 64B cache line, and my "deadness" variable fits in a byte. Let's also assume each object is 63B without the deadness variable(or exactly one cache line with it). I have 64 objects.

which means, in case 2, all 64 deadness variables fit in the cache line.

I check the first variable, it's "1", so the cache gets flushed, and the object data gets loaded in. If every object is alive, that's exactly 64 cache misses for my dataset.

In the prior case, let's say I can fit one object (jncluding the deadness variable) in a single 64B cache line.

In case 1, I load the object. If it's alive, I operate on it, load the next object, et. al.

In case 2, I load the alive dataset. If the first one is alive, I flush, put the 63B object in, operate, flush, load the alive dataset again, repeat.

If every object is alive, that's 64 unnecessary cache misses with the second case.

If every object is dead, it's of course 64 unnecessary cache misses with the first case.

I'm pretty sure in a 50/50 situation, the two approaches break even in terms of cache misses. What's more, as the data set shrinks, the first approach ought to pull ahead. Suppose 16 data items (including the flag) fit on a cache line. Now you only have to flush every 16 in even the worst case, rather than every single item in the other approach.

Am I idealizing the data here too much?

Am I idealizing the data here too much?

Edit: This is my own edit. This information is incorrect. I'm keeping the strikethrough to demonstrate the wrong way of thinking, if it's of any use.

Yes.

You're correct, in that if your objects are small already there's really no advantage to this approach.

However, suppose your objects are large -- pathologically large to make the point. Let's say your objects are 0.1 MB apiece and your cache is 2 MB. Suppose you have 1000 of these objects. Your cache can hold at most 20 of these objects. That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load. The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

Am I idealizing the data here too much?

Yes.

You're correct, in that if your objects are small already there's really no advantage to this approach.

However, suppose your objects are large -- pathologically large to make the point. Let's say your objects are 0.1 MB apiece and your cache is 2 MB. Suppose you have 1000 of these objects. Your cache can hold at most 20 of these objects. That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load. The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

The original example had 5 doubles and a char (41-44 bytes in VS's compiler), so I guess that's just a bad example.

I'm also not taking associativity into account, since I'm not too familiar with its practical effects. From what little I know about associative caches though, that's probably not a good idea to ignore it.

I check the first variable, it's "1", so the cache gets flushed....
In case 2, I load the alive dataset. If the first one is alive, I flush, put the 63B object in, operate, flush, load the alive dataset again, repeat.

Are you implying that when we load data from the object array, that all the data from the status array is purged from the cache?
If so, there's where you've gone wrong.
The cache is much, much bigger than one line. Hopefully, the status data would still be cached by the time that you've finished operating on an object.

With the first case (and your 64B total object size - object and flag - and the assumption that we're using 64B-aligned addresses), every single iteration is using data that hasn't existed in the cache before, regardless of whether the object is used or not. Every iteration you have to move 64B from memory into the cache.

With the second case (let's say the flag is 1B and the object is 64B for simplicifty), you need to move 64B of 'status' data from RAM into the cache once per 64 iterations.
If the status is true for an iteration, then you also have to move 64B of 'object' data to the cache.

So #1 is one load per iteration, while #2 is one load per 64 iterations plus one load per active iteration.
Assuming that the array is mostly inactive, then #2 will win.

To take it to the extreme, you should also be using bits for the status flag instead of bytes, which would let you store 512 status flags in one cache line.


Now you've also got to take prefetching into account. If you load 2 lines from contiguous areas in RAM (e.g. at address #0, then address #64), then the prefetching unit can make a guess that you'll probably soon want to be using more contiguous data too (e.g. so it will begin fetching a line from #128 before you even ask for it).

In case #1, this is great, because it means that while we're operating on object #4, the cache will be already downloading the data for object #5 in the background!
So, by the time we get to object #5, there will be no cache miss...
...assuming that we spent enough time processing object #4, that is!

While we're processing object #4, the cache prefetcher decides to start fetching the data for object #5, but this fetch takes time.
If we spend a few hundred CPU cycles on object #4, then this fetch will be complete, and there will be no stalling at all, no cache misses.
However, if we spend 10 CPU cycles on object #4 (e.g. checking a status flag, and then continuing), then the fetch won't be complete, the CPU will stall, waiting for the cache! We end up sitting idle for a few hundred cycles, and wasting a similar amount of time as if we actually processed the inactive object anyway!

So, for cases where the objects are mostly inactive, then case #1 is extremely bad.

On the other hand, case #2 does extremely well:
We fetch 512 status flags in one go, we find that they're all false, and do no more memory fetching!

In cases where there's a few contiguous blocks of active/inactive data (e.g. 0-12 are inactive, 13-40 are active, and 41-63 are inactive), then #2 again does very well.
When it starts working on #13, there's a stall as that object is fetched, but then while it's processing #13, the prefetcher can be grabbing #14's data in the background. It continues like this, without having any memory stalling for 14-40 (assuming that the 'operation' that it's doing on the object consumes a few hundred cycles of CPU time per object -- enough to cover the fetch time).


In cases where there's a completely random distribution of active/inactive objects, then both cases will perform quite badly, as the prefetcher will be useless.
Again though, if there's more inactive than active, then case #2 will probably still win.

The real solution for those situations would be to sort the data ahead of time, so that all the active objects are stored contiguously, and all the inactive ones elsewhere. If you do that, then instead of having active flags, you could instead have a start/end iterator describing the active range too, which is much neater ;)

i.e. design the code with linear data access patterns in mind cool.png

A few things...

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot. You'll only get a performance benefit if the average usage case is when a subset of the objects are alive. Such an application might be for loading graphics resources. Maybe a game can determine your distance to each graphical object. If it's too far away your program decides not to render it and it marks it as dead. That way your cache will never be loaded with a graphics object it won't use.

Even seemingly small objects can impact cache performance. It's better to think in terms of size of object versus remaining space in the cache. So the example you list could even affect performance depending on the situation.

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware. Fully associative are theoretically ideal. However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs). Set associative is usually the way to go on the big caches.

*stuff*

I've heard about the sorting stuff before, and it makes sense. The caching itself didn't make sense though, because I didn't take into account the relationship between a cache line and the cache itself (which I should have, given that I've taken an architecture class). That's supremely helpful. Thanks!

A few things...

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot. You'll only get a performance benefit if the average usage case is when a subset of the objects are alive. Such an application might be for loading graphics resources. Maybe a game can determine your distance to each graphical object. If it's too far away your program decides not to render it and it marks it as dead. That way your cache will never be loaded with a graphics object it won't use.

Even seemingly small objects can impact cache performance. It's better to think in terms of size of object versus remaining space in the cache. So the example you list could even affect performance depending on the situation.

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware. Fully associative are theoretically ideal. However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs). Set associative is usually the way to go on the big caches.

What I was trying to bring up (assuming one cache line, which Hodgman has already pointed out my mistake on) is that whichever way past 50/50 the objects swing (i.e. more alive, type 1 is better, more dead, type 2 is better) in general will impact the superior dataset.



The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

Code from the third link is here.

Ex. 1:














struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:














struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first. I don't really understand why.

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache. If the object is not alive, you flush the cache and carry on.

With the second example, it's inverted. Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects. Do these examples just assume that knowledge, or am I missing something about caches here?

How do you measure such fit to better cahe locality??

Maybe also alignment comes into play, I am not sure if the compiler would put three padding bytes after the leading char in example one, or not.

As to efficiency I do not know, in the ex 1 you forces all table records to sequential read (gather) in the ex 2 you force it to selective read (gather, I mean cache-read) of some records only

afaik sequential table read is faster than selective (random like) but on the other side with selective you just gather less trash into cache

(how it work much depend of the info you do not give here, how big the table is, how measurring the effect (if this is one pass read or in the loop - this is crucial), howmany records are alive here, how cahe is big, and maybe also if you touch any other tables in do_something calls)

( I would maybe guess/suppose that if few records are off the ex 1 would be noticably faster (because of sequential read of table), [though i am not quite sure is such forward prefetching work such good when reading tables with small (couple of bytes) jumps ahead - or need it be just a solid read ahead)

if many are off you got random like acces to records in array (which could be much slower, on old pentium 4 i was measuring it times ago it was about 30x slowa, newer systems are probably better here I am not sure) but you got less amount of cache trashing, as I said (but this should only be important if you are cache size constrained in some loop - it is if you working set in the loop is smaller than cache size - If I am thinking correctly)

The other thing I do not know (and I would ask it if somebody do know it) is that:

if you have a sequential read of some big table (one pass) it works quite fast - this forward prefetching is done quite good, you got it almost cache speed not ram speed even for very large arrays, but will it work when you mix such sequential read with some other reads on the other table ? I mean

for(int i=0; i<100*1000*1000; i++)

sum+=table; //this is cache-fast

for(int i=0; i<100*1000*1000; i++)

{

sum+=table;

sum2+= other_table; //will this reads kill the prefetcher of first table?

}

I am asking here because if some cpu prefetcher works 'loacally'/'temporal' (it is it checks the last amounta of ram read and takes the other after that ) it could be killed because here you have two skipping accesses, and I thing a good prefether here would be have to remember the all sequential

'fronts' of reading (which can be say not only two but ten tables of sequential read interlaced) and such prefetcher is probably harder to implement so it could fail (to prefetch some of them) ) ?

(sorry for weak english)

I was playing some test times ago on pentium 4 old machine and it was like

test()
{
for ( int k=0; k < 10*1024*1024; k++ )
mem[lut[k]]++;
}

this is code to increment 10M chars , lut is look up table,

when it is initialised sequential the time of execution was 50 ms

when it was initialised random the time was 3000 ms it is 60x worse

If somebodu would like to test it on other machines I would like to see the result too

This topic is closed to new replies.

Advertisement