The canonical example of cache utilization problems in OOP seems to be the "for all entities in list, if entity is alive, do something with it" problem.
Code from the third link is here.
Ex. 1:
struct object_def {
char status;
double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
struct object_def object_vector[Size];
void functionA()
{
int i;
for (i=0; i < Size; i++) {
if (object_vector[i].status)
do_something_with_object(object_vector[i]);
}
}
Ex. 2:
struct object_def {
double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
struct object_def object_vector[Size];
char object_vector_status[Size];
void functionA()
{
int i;
for ( i=0; i<Size; i++) {
if (object_vector_status[i])
do_something_with_object(status[i],object_vector[i]);
}
}
Supposedly, the second example exhibits better cache locality than the first. I don't really understand why.
If the object is active, and you need to work with it, the rest of the data for that object (or at least some of it) is already in the cache. If the object is not alive, you flush the cache and carry on.
With the second example, it's inverted. Since your cache line now consists of all your "isAlive" values, every time you find a live object, you have to flush the cache, whereas if the object is not alive, you can carry on.
I don't see any appreciable reason to use one versus the other, unless you have pre-existing knowledge of the "aliveness distribution" of your objects. Do these examples just assume that knowledge, or am I missing something about caches here?
How do you measure such fit to better cahe locality??
Maybe also alignment comes into play, I am not sure if the compiler would put three padding bytes after the leading char in example one, or not.
As to efficiency I do not know, in the ex 1 you forces all table records to sequential read (gather) in the ex 2 you force it to selective read (gather, I mean cache-read) of some records only
afaik sequential table read is faster than selective (random like) but on the other side with selective you just gather less trash into cache
(how it work much depend of the info you do not give here, how big the table is, how measurring the effect (if this is one pass read or in the loop - this is crucial), howmany records are alive here, how cahe is big, and maybe also if you touch any other tables in do_something calls)
( I would maybe guess/suppose that if few records are off the ex 1 would be noticably faster (because of sequential read of table), [though i am not quite sure is such forward prefetching work such good when reading tables with small (couple of bytes) jumps ahead - or need it be just a solid read ahead)
if many are off you got random like acces to records in array (which could be much slower, on old pentium 4 i was measuring it times ago it was about 30x slowa, newer systems are probably better here I am not sure) but you got less amount of cache trashing, as I said (but this should only be important if you are cache size constrained in some loop - it is if you working set in the loop is smaller than cache size - If I am thinking correctly)
The other thing I do not know (and I would ask it if somebody do know it) is that:
if you have a sequential read of some big table (one pass) it works quite fast - this forward prefetching is done quite good, you got it almost cache speed not ram speed even for very large arrays, but will it work when you mix such sequential read with some other reads on the other table ? I mean
for(int i=0; i<100*1000*1000; i++)
sum+=table; //this is cache-fast
for(int i=0; i<100*1000*1000; i++)
{
sum+=table;
sum2+= other_table; //will this reads kill the prefetcher of first table?
}
I am asking here because if some cpu prefetcher works 'loacally'/'temporal' (it is it checks the last amounta of ram read and takes the other after that ) it could be killed because here you have two skipping accesses, and I thing a good prefether here would be have to remember the all sequential
'fronts' of reading (which can be say not only two but ten tables of sequential read interlaced) and such prefetcher is probably harder to implement so it could fail (to prefetch some of them) ) ?
(sorry for weak english)
I was playing some test times ago on pentium 4 old machine and it was like
test()
{
for ( int k=0; k < 10*1024*1024; k++ )
mem[lut[k]]++;
}
this is code to increment 10M chars , lut is look up table,
when it is initialised sequential the time of execution was 50 ms
when it was initialised random the time was 3000 ms it is 60x worse
If somebodu would like to test it on other machines I would like to see the result too