Sign in to follow this  
SeraphLance

Understanding Cache Utilization

Recommended Posts

The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

 

Code from the third link is here.

 

Ex. 1:

struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:

struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first.  I don't really understand why.

 

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache.  If the object is not alive, you flush the cache and carry on.

 

With the second example, it's inverted.  Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

 

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects.  Do these examples just assume that knowledge, or am I missing something about caches here?

Edited by SeraphLance

Share this post


Link to post
Share on other sites

It's something like this:

 

In the first case dead objects are still loaded into the cache.  For large objects this can be time consuming.

 

In the second case dead objects are not loaded into the cache.

 

In both cases, if the object is alive you have to load the cache with the object if it's not already there.  Otherwise it's already in the cache.

 

The real benefit, therefore, is in the case of checking against dead objects.  You load dead objects in case 1 and don't load them in case 2 (as the worst case scenario).

Share this post


Link to post
Share on other sites

In the first example you'll always cache, which can result in a high amount of cache misses(due to objects being dead). In the second example, the amount of cache misses will most likely be less. Also note that your object size is smaller in the second example, which allows for more objects to be cached, resulting into less cache misses.

Share this post


Link to post
Share on other sites

Let's say I have a 64B cache line, and my "deadness" variable fits in a byte.  Let's also assume each object is 63B without the deadness variable(or exactly one cache line with it).  I have 64 objects.

 

which means, in case 2, all 64 deadness variables fit in the cache line.

 

I check the first variable, it's "1", so the cache gets flushed, and the object data gets loaded in.  If every object is alive, that's exactly 64 cache misses for my dataset.

 

In the prior case, let's say I can fit one object (jncluding the deadness variable) in a single 64B cache line.

 

In case 1, I load the object.  If it's alive, I operate on it, load the next object, et. al.

 

In case 2, I load the alive dataset.  If the first one is alive, I flush, put the 63B object in, operate, flush, load the alive dataset again, repeat.

 

If every object is alive, that's 64 unnecessary cache misses with the second case.

 

If every object is dead, it's of course 64 unnecessary cache misses with the first case.

 

I'm pretty sure in a 50/50 situation, the two approaches break even in terms of cache misses.  What's more, as the data set shrinks, the first approach ought to pull ahead.  Suppose 16 data items (including the flag) fit on a cache line.  Now you only have to flush every 16 in even the worst case, rather than every single item in the other approach.

 

Am I idealizing the data here too much?

Edited by SeraphLance

Share this post


Link to post
Share on other sites

Am I idealizing the data here too much?

 

Edit:  This is my own edit.  This information is incorrect.  I'm keeping the strikethrough to demonstrate the wrong way of thinking, if it's of any use.

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load.  The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

 

Edited by Cosmic314

Share this post


Link to post
Share on other sites

 

Am I idealizing the data here too much?

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

In the second approach at the cost of 1000 bytes in your cache, you'll know exactly which ones to load.  The advantage is more noticeable, obviously, if you only need to load a subset of those 1000 objects.

 

 

The original example had 5 doubles and a char (41-44 bytes in VS's compiler), so I guess that's just a bad example.

 

I'm also not taking associativity into account, since I'm not too familiar with its practical effects.  From what little I know about associative caches though, that's probably not a good idea to ignore it.

Edited by SeraphLance

Share this post


Link to post
Share on other sites

A few things...

 

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot.  You'll only get a performance benefit if the average usage case is when a subset of the objects are alive.  Such an application might be for loading graphics resources.  Maybe a game can determine your distance to each graphical object.  If it's too far away your program decides not to render it and it marks it as dead.  That way your cache will never be loaded with a graphics object it won't use.

 

Even seemingly small objects can impact cache performance.  It's better to think in terms of size of object versus remaining space in the cache.  So the example you list could even affect performance depending on the situation.

 

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware.  Fully associative are theoretically ideal.  However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs).  Set associative is usually the way to go on the big caches.

Share this post


Link to post
Share on other sites

 

*stuff*

 

I've heard about the sorting stuff before, and it makes sense.  The caching itself didn't make sense though, because I didn't take into account the relationship between a cache line and the cache itself (which I should have, given that I've taken an architecture class).  That's supremely helpful.  Thanks!

 

 

A few things...

 

If all of your objects are always alive then there's no point to have a variable that indicates them as such and the argument is moot.  You'll only get a performance benefit if the average usage case is when a subset of the objects are alive.  Such an application might be for loading graphics resources.  Maybe a game can determine your distance to each graphical object.  If it's too far away your program decides not to render it and it marks it as dead.  That way your cache will never be loaded with a graphics object it won't use.

 

Even seemingly small objects can impact cache performance.  It's better to think in terms of size of object versus remaining space in the cache.  So the example you list could even affect performance depending on the situation.

 

The cost of fully associative vs. set associative caching is usually a burden of circuit paths in hardware.  Fully associative are theoretically ideal.  However, the complexity of the circuit logic could increase your access time and limit the maximum cache frequency (it will definitely increase your power, an important aspect in mobile designs).  Set associative is usually the way to go on the big caches.

 

What I was trying to bring up (assuming one cache line, which Hodgman has already pointed out my mistake on) is that whichever way past 50/50 the objects swing (i.e. more alive, type 1 is better, more dead, type 2 is better) in general will impact the superior dataset.

Share this post


Link to post
Share on other sites


The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.

 

Code from the third link is here.

 

Ex. 1:













struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

Ex. 2:













struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Supposedly, the second example exhibits better cache locality than the first.  I don't really understand why.

 

If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache.  If the object is not alive, you flush the cache and carry on.

 

With the second example, it's inverted.  Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.

 

I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects.  Do these examples just assume that knowledge, or am I missing something about caches here?

 

How do you measure such fit to better cahe locality??

Maybe also alignment comes into play, I am not sure if the compiler would put three padding bytes after the leading char in example one, or not.

 

As to efficiency I do not know, in the ex 1 you forces all table records to sequential read (gather) in the ex 2 you force it to selective read (gather, I mean cache-read) of some records only

 

afaik sequential table read is faster than selective (random like) but on the other side with selective you just gather less trash into cache

 

(how it work much depend of the info you do not give here, how big the table is, how measurring the effect (if this is one pass read or in the loop - this is crucial), howmany records are alive here, how cahe is big, and maybe also if you touch any other tables in do_something calls)

 

 

( I would maybe guess/suppose that if few records are off  the ex 1 would be noticably faster (because of sequential read of table), [though i am not quite sure is such forward prefetching work such good when reading tables with small (couple of bytes) jumps ahead - or need it be just a solid read ahead)

 

if many are off you got random like acces to records in array (which could be much slower, on old pentium 4 i was measuring it times ago it was about 30x slowa, newer systems are probably better here I am not sure)  but you got less amount of cache trashing, as I said (but this should only be important if you  are cache size constrained in some loop - it is if you working set in the loop is smaller than cache size - If I am thinking correctly)

 

The other thing I do not know (and I would ask it if somebody do know it) is that:

if you have a sequential read of some big table (one pass) it works quite fast - this forward prefetching is done quite good, you got it almost cache speed not ram speed even for very large  arrays, but will it work when you mix such sequential read with some other reads on the other table ? I mean 

 

for(int i=0; i<100*1000*1000; i++)

    sum+=table[i];                          //this is cache-fast

 

 

for(int i=0; i<100*1000*1000; i++)

{

    sum+=table[i];                          

 

    sum2+= other_table[i];   //will this reads kill the prefetcher of first table?

}

 

I am asking here because if some cpu prefetcher works 'loacally'/'temporal' (it is it checks the last amounta of ram read and takes the other after that )  it could be killed because here you have two skipping accesses, and I thing a good prefether here would be have to remember the all sequential

 'fronts'  of reading (which can be say not only two but ten tables of sequential read interlaced) and such prefetcher is probably harder to implement so it could fail (to prefetch some of them) ) ?

(sorry for weak english)

 

I was playing some test times ago on pentium 4 old machine and it was like

 

test()
{
 for ( int k=0; k < 10*1024*1024; k++ )
 mem[lut[k]]++;
}

 

this is code to increment 10M chars , lut is look up table,

when it is initialised sequential the time of execution was 50 ms

when it was initialised random the time was 3000 ms it is 60x worse

 

If somebodu would like to test it on other machines I would like to see the result too

Edited by fir

Share this post


Link to post
Share on other sites

 

for(int i=0; i<100*1000*1000; i++)

    sum+=table[i];                          //this is cache-fast

 

for(int i=0; i<100*1000*1000; i++)

{

    sum+=table[i];                          

 

    sum2+= other_table[i]; //will this line kill the prefetcher of first table?

}

 

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.

Edited by SeraphLance

Share this post


Link to post
Share on other sites

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.


I think you'd be surprised by this; a loop as posted is simply data reads from two locations the CPU will be able to work that out without much of a problem.

Share this post


Link to post
Share on other sites

 

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.


I think you'd be surprised by this; a loop as posted is simply data reads from two locations the CPU will be able to work that out without much of a problem.

 

 

Are there some decent articles, how this prefetching algorithms run? Such optimizations may be not most crucial to me (cause it takes a time for development itself, but I used to like to optimize stuff too :C )

Share this post


Link to post
Share on other sites

 

Am I idealizing the data here too much?

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

That's not how cache works. The cache will cache the data you touch, not all the data contained within an object (The CPU has no idea what an object is, it just see's addresses). The advantage of the second approach here, is more the fact that mem-loads happen in 16byte blocks (on an SSE enabled CPU). 

 

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.

The prefetcher is perfectly happy reading 'N' streams of data, so long as the access pattern remains constant (preferably linear). The prefetcher only starts to have issues when you are reading random unrelated memory addresses (*assuming you're talking about Intel CPU's after the Pentium 4 cock up. YMMV on other architectures). 

 

How do you measure such fit to better cahe locality??
x64 reads memory in 16 byte chunks. In the first example, the single status byte will be read (along with 15 bytes of crap that will be discarded). In the second example, 16 x status bytes will be read in one go without having to discard any data (really though, the first example is a terrible memory layout that should have been fixed anyway). However, unless 'Size' is a very big number (e.g. 500000), and the status is false more often than not, then you're unlikely to see much of a performance difference between either approach. Just make the code work in the stupidest fashion possible, and worry about this stuff when you can prove you have a problem. 

Share this post


Link to post
Share on other sites

 

 

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.


I think you'd be surprised by this; a loop as posted is simply data reads from two locations the CPU will be able to work that out without much of a problem.

 

 

Are there some decent articles, how this prefetching algorithms run? Such optimizations may be not most crucial to me (cause it takes a time for development itself, but I used to like to optimize stuff too :C )

 

 

I've seen many articles on prefetching, and every single one of them is out of date. On the Pentium 4 you needed to worry about this, but on a modern CPU, just try to keep memory access linear, and you will never need to worry it. If you're forced to access memory in a random order, then prefetching *may* help. However this is something for which you do not want to read an article on. If you do, you'll de-optimise your code (because inevitably, the article will be a couple of years old, and will not apply to the CPU you are currently using). If the profiler says you have a performance problem, then by all means try to use prefetching if it makes sense (usually though, streaming has historically been a more useful cache optimisation, although apparently the behaviour of the memory controller has changed on the latest Intel CPU's, so even that advice might be out of date now). If you do come across a prefetching article, and they attempt to demonstrate how prefetching the next few elements of a linear array will improve performance, you can safely assume the author hasn't got a clue what they're talking about.... 

Share this post


Link to post
Share on other sites

 

 

Am I idealizing the data here too much?

 

Yes. 

 

You're correct, in that if your objects are small already there's really no advantage to this approach.

 

However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 

That's not how cache works. The cache will cache the data you touch, not all the data contained within an object (The CPU has no idea what an object is, it just see's addresses). The advantage of the second approach here, is more the fact that mem-loads happen in 16byte blocks (on an SSE enabled CPU). 

 

 

I stand corrected!

Share this post


Link to post
Share on other sites

 

 

 

Your intuition is correct here -- there's no way a prefetcher is going to be smart enough to realize that you're oscillating between distinct data sets.

I think you'd be surprised by this; a loop as posted is simply data reads from two locations the CPU will be able to work that out without much of a problem.

 

 

Are there some decent articles, how this prefetching algorithms run? Such optimizations may be not most crucial to me (cause it takes a time for development itself, but I used to like to optimize stuff too :C )

 

 

I've seen many articles on prefetching, and every single one of them is out of date. On the Pentium 4 you needed to worry about this, but on a modern CPU, just try to keep memory access linear, and you will never need to worry it. If you're forced to access memory in a random order, then prefetching *may* help. However this is something for which you do not want to read an article on. If you do, you'll de-optimise your code (because inevitably, the article will be a couple of years old, and will not apply to the CPU you are currently using). If the profiler says you have a performance problem, then by all means try to use prefetching if it makes sense (usually though, streaming has historically been a more useful cache optimisation, although apparently the behaviour of the memory controller has changed on the latest Intel CPU's, so even that advice might be out of date now). If you do come across a prefetching article, and they attempt to demonstrate how prefetching the next few elements of a linear array will improve performance, you can safely assume the author hasn't got a clue what they're talking about.... 

 

 

What do you mean by streaming, and what do you mean by prefetching ?

 

By prefetching (could also call it forward prefetching) i mean the automatic gathering ram into a cache when you read a large array in sequentially forward direction (though backward reading is working the same good if I remember correctly)

 

Those are the edge cases: one when cache prefetching works 

vary well (as processing arrai in sequential) and the other (like random jumping into not cached ram) when it works much 

slower (10x? 20x ? 50x?)

 

in programming I think there are also cases in between the two 

above, for example the one I said about procesiing two interleaved arrays  sequentially, or processing array sequentially but by some small jumps/gaps when you use some field in array of reccords and do not touch the other ones

 

for(int i=0; i<100*mega; i++)

{

   sum += bytearray1[i];

   sum += bytearray2[i];

   sum += bytearray3[i];   

   sum += bytearray4[i];  // 4-sequential but interleaved

}

 

 

for(int i=0; i<100*mega; i+=some) // some = 10 or 100 for example

{

   sum += bytearray[i];       //sequential but not solid (with gaps)

}      // <-- how with that one, no prefetching and slow as random one?

 

I do not know how well this can be handled. 

 

There are yet more complex cases i think and do not know

how it can handle it - probably they woul look more like random

acces than the sequential one, so probably it cant do nothing 

and the sequential prefetching is the only trick it can be used 

here (but i do not know)

Edited by fir

Share this post


Link to post
Share on other sites

Am I idealizing the data here too much?

 
Yes. 
 
You're correct, in that if your objects are small already there's really no advantage to this approach.
 
However, suppose your objects are large -- pathologically large to make the point.  Let's say your objects are 0.1 MB apiece and your cache is 2 MB.  Suppose you have 1000 of these objects.  Your cache can hold at most 20 of these objects.  That means you'll be thrashing the cache when you loop through your 1000 objects, because they'll all have to be loaded to check if they should be used or not.

 
That's not how cache works. The cache will cache the data you touch, not all the data contained within an object (The CPU has no idea what an object is, it just see's addresses). The advantage of the second approach here, is more the fact that mem-loads happen in 16byte blocks (on an SSE enabled CPU). 

 
x64 reads memory in 16 byte chunks. 

That's not right, I'm not sure where you get 16 bytes from. 8 bytes are read at a time from a memory device, but it always fills in a cache line even if it doesn't need it, because it can't cache in units less then a line.

That may sound inefficient, but the cost of getting an additional few bytes of memory in order is much less then the cost of requesting and finding that first bit that you actually need. It also reduces the complexity of the memory controller, because the size and granularity address as it sees is better. Edited by King Mir

Share this post


Link to post
Share on other sites

Note that most of this is just semi-educated guesses.

 

You don't actually know until you have measured it.  To actually know you start by using tools like vtune and codeanalyst that will measure the values. 

Share this post


Link to post
Share on other sites

Note that most of this is just semi-educated guesses.

 

You don't actually know until you have measured it.  To actually know you start by using tools like vtune and codeanalyst that will measure the values. 

 

ye thats true, tests are reliable, (I got always quite nice info

ftrom test), but testing also cost an decent amount of energy 

and time (I am a tired programmer) and it would be nice to chat and  exchange information 'easier way' (also testing as an approach shows you some surface/edge of things, and potentially

can mislead you sometimes, unlike some reasoning based on

adequate inner mechanisms )

Share this post


Link to post
Share on other sites

ye thats true, tests are reliable, (I got always quite nice info
ftrom test), but testing also cost an decent amount of energy 
and time (I am a tired programmer) and it would be nice to chat and  exchange information 'easier way' (also testing as an approach shows you some surface/edge of things, and potentially
can mislead you sometimes, unlike some reasoning based on
adequate inner mechanisms )


The problem with this attitude is that you assume - incorrectly - that your intuition about how hardware works is worth anything.

I do not say this to single you or anyone else out, or to be rude; the simple fact is that modern processing hardware is incredibly sophisticated. Even the best silicon engineers in the world very rarely have a complete idea of how every part of a CPU pipeline works.

If the experts who design these chips don't have perfect intuition about how the performance will play out, the chances of a software guy getting it right more than 50% of the time are near zero. If you want to talk about conserving energy and time, not wasting your effort on pointless guesswork is a great way to start.

Share this post


Link to post
Share on other sites

I'm willing to bet he is "right" as the chances are the CPU will fetch 16 bytes, maybe in two 8 byte reads, at the very least when accessing an address; to do otherwise would potentially introduce a very early cache stall.

Where does the 16 bytes come from? the CPU will fetch 64 bytes in 8 8-byte reads.

Share this post


Link to post
Share on other sites

ye thats true, tests are reliable, (I got always quite nice info
ftrom test), but testing also cost an decent amount of energy 
and time (I am a tired programmer) and it would be nice to chat and  exchange information 'easier way' (also testing as an approach shows you some surface/edge of things, and potentially
can mislead you sometimes, unlike some reasoning based on
adequate inner mechanisms )


The problem with this attitude is that you assume - incorrectly - that your intuition about how hardware works is worth anything.

I do not say this to single you or anyone else out, or to be rude; the simple fact is that modern processing hardware is incredibly sophisticated. Even the best silicon engineers in the world very rarely have a complete idea of how every part of a CPU pipeline works.

If the experts who design these chips don't have perfect intuition about how the performance will play out, the chances of a software guy getting it right more than 50% of the time are near zero. If you want to talk about conserving energy and time, not wasting your effort on pointless guesswork is a great way to start.

I agree that you should always profile when optimizing performance, but you need guesswork and intuition to decide which optimizations to try out. So even if the odds of an optimization working are low, it's worth discussing and trying to impart a public intuition about what works and what doesn't.

Share this post


Link to post
Share on other sites
The real solution for those situations would be to sort the data ahead of time, so that all the active objects are stored contiguously, and all the inactive ones elsewhere. If you do that, then instead of having active flags, you could instead have a start/end iterator describing the active range too, which is much neater ;)

i.e. design the code with linear data access patterns in mind cool.png

 

Could you elaborate on this a little more?  It seems like sorting data isn't always going to be a win.  The only time I can imagine where this would be beneficial is if a certain system required the use of many small objects with common data transformations that could be basically double buffered.  As updates progress, you would simply copy the objects that survived frame i to the buffer for frame i + 1.  If instead the objects were so large that a copy would be prohibitively expensive and you instead sort references, it doesn't seem like you gain significant advantage since consecutive references aren't actually guaranteed to be near each other.  Does this case rarely occur in practice?

 

In the game I'm working on, I decided to set up object pools to help contain fragmentation of the heap and to also improve cache locality for operations on entire collections of objects (since they all come from the same memory block).  However, my entity lists aren't sorted in any way that guarantees locality and some of the objects are pretty large so I don't end up compacting those heap blocks since I fear the cost of data movement is higher than just eating the cache miss.  Am I mistaken in this assumption?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this