Jump to content
  • Advertisement
Sign in to follow this  
SeraphLance

Understanding Cache Utilization

This topic is 1847 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

 

Code from the third link is here.

 

Ex. 1:

struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:

struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first.  I don't really understand why.

 

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache.  If the object is not alive, you flush the cache and carry on.

 

With the second example, it's inverted.  Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

 

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects.  Do these examples just assume that knowledge, or am I missing something about caches here?

Edited by SeraphLance

Share this post


Link to post
Share on other sites
Advertisement

It's something like this:

 

In the first case dead objects are still loaded into the cache.  For large objects this can be time consuming.

 

In the second case dead objects are not loaded into the cache.

 

In both cases, if the object is alive you have to load the cache with the object if it's not already there.  Otherwise it's already in the cache.

 

The real benefit, therefore, is in the case of checking against dead objects.  You load dead objects in case 1 and don't load them in case 2 (as the worst case scenario).

Share this post


Link to post
Share on other sites

In the first example you'll always cache, which can result in a high amount of cache misses(due to objects being dead). In the second example, the amount of cache misses will most likely be less. Also note that your object size is smaller in the second example, which allows for more objects to be cached, resulting into less cache misses.

Share this post


Link to post
Share on other sites

Let's say I have a 64B cache line, and my "deadness" variable fits in a byte.  Let's also assume each object is 63B without the deadness variable(or exactly one cache line with it).  I have 64 objects.

 

which means, in case 2, all 64 deadness variables fit in the cache line.

 

I check the first variable, it's "1", so the cache gets flushed, and the object data gets loaded in.  If every object is alive, that's exactly 64 cache misses for my dataset.

 

In the prior case, let's say I can fit one object (jncluding the deadness variable) in a single 64B cache line.

 

In case 1, I load the object.  If it's alive, I operate on it, load the next object, et. al.

 

In case 2, I load the alive dataset.  If the first one is alive, I flush, put the 63B object in, operate, flush, load the alive dataset again, repeat.

 

If every object is alive, that's 64 unnecessary cache misses with the second case.

 

If every object is dead, it's of course 64 unnecessary cache misses with the first case.

 

I'm pretty sure in a 50/50 situation, the two approaches break even in terms of cache misses.  What's more, as the data set shrinks, the first approach ought to pull ahead.  Suppose 16 data items (including the flag) fit on a cache line.  Now you only have to flush every 16 in even the worst case, rather than every single item in the other approach.

 

Am I idealizing the data here too much?

Edited by SeraphLance

Share this post


Link to post
Share on other sites

Am I idealizing the data here too much?

 

Edit:  This is my own edit.  This information is incorrect.  I'm keeping the strikethrough to demonstrate the wrong way of thinking, if it's of any use.

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load.  The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

 

Edited by Cosmic314

Share this post


Link to post
Share on other sites

 

Am I idealizing the data here too much?

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load.  The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

 

 

The original example had 5 doubles and a char (41-44 bytes in VS's compiler), so I guess that's just a bad example.

 

I'm also not taking associativity into account, since I'm not too familiar with its practical effects.  From what little I know about associative caches though, that's probably not a good idea to ignore it.

Edited by SeraphLance

Share this post


Link to post
Share on other sites

A few things...

 

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot.  You'll only get a performance benefit if the average usage case is when a subset of the objects are alive.  Such an application might be for loading graphics resources.  Maybe a game can determine your distance to each graphical object.  If it's too far away your program decides not to render it and it marks it as dead.  That way your cache will never be loaded with a graphics object it won't use.

 

Even seemingly small objects can impact cache performance.  It's better to think in terms of size of object versus remaining space in the cache.  So the example you list could even affect performance depending on the situation.

 

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware.  Fully associative are theoretically ideal.  However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs).  Set associative is usually the way to go on the big caches.

Share this post


Link to post
Share on other sites

 

*stuff*

 

I've heard about the sorting stuff before, and it makes sense.  The caching itself didn't make sense though, because I didn't take into account the relationship between a cache line and the cache itself (which I should have, given that I've taken an architecture class).  That's supremely helpful.  Thanks!

 

 

A few things...

 

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot.  You'll only get a performance benefit if the average usage case is when a subset of the objects are alive.  Such an application might be for loading graphics resources.  Maybe a game can determine your distance to each graphical object.  If it's too far away your program decides not to render it and it marks it as dead.  That way your cache will never be loaded with a graphics object it won't use.

 

Even seemingly small objects can impact cache performance.  It's better to think in terms of size of object versus remaining space in the cache.  So the example you list could even affect performance depending on the situation.

 

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware.  Fully associative are theoretically ideal.  However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs).  Set associative is usually the way to go on the big caches.

 

What I was trying to bring up (assuming one cache line, which Hodgman has already pointed out my mistake on) is that whichever way past 50/50 the objects swing (i.e. more alive, type 1 is better, more dead, type 2 is better) in general will impact the superior dataset.

Share this post


Link to post
Share on other sites


The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

 

Code from the third link is here.

 

Ex. 1:













struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:













struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first.  I don't really understand why.

 

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache.  If the object is not alive, you flush the cache and carry on.

 

With the second example, it's inverted.  Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

 

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects.  Do these examples just assume that knowledge, or am I missing something about caches here?

 

How do you measure such fit to better cahe locality??

Maybe also alignment comes into play, I am not sure if the compiler would put three padding bytes after the leading char in example one, or not.

 

As to efficiency I do not know, in the ex 1 you forces all table records to sequential read (gather) in the ex 2 you force it to selective read (gather, I mean cache-read) of some records only

 

afaik sequential table read is faster than selective (random like) but on the other side with selective you just gather less trash into cache

 

(how it work much depend of the info you do not give here, how big the table is, how measurring the effect (if this is one pass read or in the loop - this is crucial), howmany records are alive here, how cahe is big, and maybe also if you touch any other tables in do_something calls)

 

 

( I would maybe guess/suppose that if few records are off  the ex 1 would be noticably faster (because of sequential read of table), [though i am not quite sure is such forward prefetching work such good when reading tables with small (couple of bytes) jumps ahead - or need it be just a solid read ahead)

 

if many are off you got random like acces to records in array (which could be much slower, on old pentium 4 i was measuring it times ago it was about 30x slowa, newer systems are probably better here I am not sure)  but you got less amount of cache trashing, as I said (but this should only be important if you  are cache size constrained in some loop - it is if you working set in the loop is smaller than cache size - If I am thinking correctly)

 

The other thing I do not know (and I would ask it if somebody do know it) is that:

if you have a sequential read of some big table (one pass) it works quite fast - this forward prefetching is done quite good, you got it almost cache speed not ram speed even for very large  arrays, but will it work when you mix such sequential read with some other reads on the other table ? I mean 

 

for(int i=0; i<100*1000*1000; i++)

    sum+=table;                          //this is cache-fast

 

 

for(int i=0; i<100*1000*1000; i++)

{

    sum+=table;                          

 

    sum2+= other_table;   //will this reads kill the prefetcher of first table?

}

 

I am asking here because if some cpu prefetcher works 'loacally'/'temporal' (it is it checks the last amounta of ram read and takes the other after that )  it could be killed because here you have two skipping accesses, and I thing a good prefether here would be have to remember the all sequential

 'fronts'  of reading (which can be say not only two but ten tables of sequential read interlaced) and such prefetcher is probably harder to implement so it could fail (to prefetch some of them) ) ?

(sorry for weak english)

 

I was playing some test times ago on pentium 4 old machine and it was like

 

test()
{
 for ( int k=0; k < 10*1024*1024; k++ )
 mem[lut[k]]++;
}

 

this is code to increment 10M chars , lut is look up table,

when it is initialised sequential the time of execution was 50 ms

when it was initialised random the time was 3000 ms it is 60x worse

 

If somebodu would like to test it on other machines I would like to see the result too

Edited by fir

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!