Swapping addresses between pointers and cache locality

Started by
8 comments, last by Suen 11 years, 7 months ago
Assume I have a vector that consist of pointers. When the content of this vector is read (one element at a time in order) I should be reducing cache misses (since the pointers would be adjacent to each other in memory). Now if we assume what each pointer is POINTING at also lies in a continuous block of memory then from what I've understood it should also reduce cache misses (corrections here would be welcomed). To clarify with some code:


Foo* foo = new Foo[size];
std::vector<Foo*> fooVector;

for(int i=0; i<size; i++)
{
//fill fooVector with &foo
}


Now let's assume that during run-time an event occurs and each time this event occurs a pointer in the vector is being processed. Some operations are done on the value being pointed at by this pointer and then a swap is done. The swap takes the address being pointed at by the current processed pointer and swap it with the address being pointed at by the last pointer in the vector. Note that 'last' here doesn't necessarily mean the last element in the vector, there could be some variable which keeps track of what it considers to be the "last element" in the vector, this is not important though as the point here is that the addresses that the pointers are pointing at are being swapped between the pointers. Again some code to clarify what I mean:


void processPointer(int pointerId)
{
//do some operations on foo object being pointed at with fooVector[pointerId]

//Swap the addresses between two pointers
std::swap(fooVector[pointerId], fooVector[lastElement]);
}



Now as far as I am concerned, the next time the vector is read the pointers are still adjacent in memory so there is no change there. What they are pointing at becomes unordered with time though, the first element in the vector could be pointing at the third element in the array allocated from the heap and the third element in the vector could be pointing at the tenth element in the array and so on. How does this affect cache locality when the values, which the pointers are pointing at, are read when everything is unordered? I realize that the values that are being pointed at still belong to the same continuous block but a cache line doesn't consist of unlimited memory.

To elaborate on the last sentence above I see it this way: say we end up in a situation where fooVector[0] points at foo[3] and fooVector{1] point at foo[10]. Now when I go through my vector, step by step and perform calculations on the values being pointed at then when I process fooVector[0] I could perhaps be reading foo[3], foo[4] and foo[5] (perhaps something before foo[3] as well, I don't honestly know if it looks "behind" it or not) and that might be as much as I could fit in one cache line. When I process fooVector[1] then foo[10] will not be in the cache and thus I would be incurring a cache miss. I realize this entirely depends on the actual size of my objects (a foo object in this case) but this is how I thought of it.

I could be wrong about many things here, I'm fairly new when it comes to considering the cache when coding so I could have assumed many things which are wrong, if so please let me know.
Advertisement
So basically you have two contiguous blocks of memory, one for the pointers and one for the objects and you are accessing the objects in an unordered fashion? That could very well lead to an increased number of cache misses, but who knows? Is there an underlying problem you're trying to solve?
Assume I have a vector that consist of pointers. When the content of this vector is read (one element at a time in order) I should be reducing cache misses (since the pointers would be adjacent to each other in memory). Now if we assume what each pointer is POINTING at also lies in a continuous block of memory then from what I've understood it should also reduce cache misses (corrections here would be welcomed).[/quote]
Yes and yes. Since you know how cache lines work, I assume that both your vector's data (pointers) and its pointed data is larger than a single cache line (otherwise you are good no matter what you do, since both arrays are contiguous).

But the real question is, unless I missed something, why do you need to swap the pointers in the first place, and how often do you process pointers and/or iterate the array?
And also, how large are these arrays? The cache consists of several cache lines, so if they fits into the collected cache, it might not become an issue at all.

Of course, if you have a real case, and this is not just hypothetical, you really should profile with a profiler that can highlight cache usage/misses.

So basically you have two contiguous blocks of memory, one for the pointers and one for the objects and you are accessing the objects in an unordered fashion? That could very well lead to an increased number of cache misses, but who knows? Is there an underlying problem you're trying to solve?


Yes it's pretty much as you said. Two contiguous blocks of memory but one of them ends up being accessed in an unordered fashion as time goes by. As for an underlying problem, I'm trying to avoid having cache misses as much as possible. It is just a goal I want to be able to fulfill as good as possible.

The problem here is that my vector is supposed to only hold pointers that point to objects which I will perform operations on in my program. When I want to stop considering an object I get "rid" of the corresponding pointer. Instead of removing the pointer from the vector (and possibly causing a re-allocation of all remaining elements in the vector) I merely swap it with the last element and decrease a counter that keeps track of how many objects that are left. This gives me two advantages: I do not constantly have to remove and add pointers to my vector and I will only access objects which must be changed, any object whose pointer I swapped earlier will not be considered (as in they won't even be processed until a specific event occurs).

The alternative way would be to add/remove pointers from my vector or even add/remove objects from a vector (as both of these would avoid the whole swap-deal I mentioned) but this is not very suitable in my case as I would be doing many allocation/de-allocations in a short time (not to speak of all data that has to be moved around each time an object is removed for example) and that has it's own problems. I posted more about this in a thread I created in the game programming section, although that thread wasn't exclusivle related to this issue only. The way I have it now allow my objects and the pointers to be allocated on the heap once and once only throughout the whole life of the program. But yes my current solution results in the unordered access as explained.

I could go back to my old implementation, which avoided the issue of accessing my array of objects in an unordered fashion. The disadvantage to this is that I have to check for a condition for ALL my objects to decide if an object should be processed or not each time I go through my main loop. For example, assuming that I have 32 objects and call a method on each of these objects 60 times per second I would be making 32*60 if-calls in just one second. That's just for updating my object, I have to do the same when I render my objects (and the rendering is done many more times per second than my update).

I suppose it's quite hard to really know which one that would affect the overall performance more without really doing any profiling. Ultimately I would want to avoid cache misses more than branches which kind of answers my own question as to what solution I should go with but then I want the best of the both.

Anyway sorry to go off-topic about this. I was just wondering if the unordered access of my objects would increase the amount of cache misses I would have.

Yes and yes. Since you know how cache lines work, I assume that both your vector's data (pointers) and its pointed data is larger than a single cache line (otherwise you are good no matter what you do, since both arrays are contiguous).[/quote]

My vector consist of 32 pointers whereas each pointer occupy 4 bytes. So in total 128 bytes. I remember reading that there are architectures that use cache line sizes of 128 bytes but it is a bit unrealistic to expect this for all PC's. I think it would be better to assume a smaller size such as 32 bytes which would require four cache lines just for the vector. As for the array which hold my objects (32 objects), each object take up 20 bytes so in total 640 bytes. I could reduce the object size to 16 bytes by using a datatype that takes up less but I've decided to not go that route yet before I do some profiling later. So at the moment a cache line can only hold one object (quick question, since the object occupy 20 bytes of the 32 available bytes, will the rest be used for the neighboring object or will it end up being unused, which of course isn't a good thing but not always avoidable). I realize we're talking about really small sizes here and most people might say it won't make any differences performance-wise but I have my set of goals I want to try and fulfill.


But the real question is, unless I missed something, why do you need to swap the pointers in the first place, and how often do you process pointers and/or iterate the array?
[/quote]

See reply above. To add to it, I currently iterate through my vector way above 60+ times per second (the array is rarely iterated since most updates on the objects in the array are done when I process the pointers pointing at them, at most it is iterated once every 2-3 minutes)


And also, how large are these arrays? The cache consists of several cache lines, so if they fits into the collected cache, it might not become an issue at all.
[/quote]

Should be answered in the first paragraph if I'm not mistaken. But again, even if most of it were to be able to fit in the L1 cache I want to avoid having it as an excuse to make bad design choices ;)


Of course, if you have a real case, and this is not just hypothetical, you really should profile with a profiler that can highlight cache usage/misses.
[/quote]

Well this really seems to be the best choice although I just wanted to make sure I had understood my underlying problem correct (that I'm, at least theoratically, incurring more cache misses with my current solution).

Instead of removing the pointer from the vector (and possibly causing a re-allocation of all remaining elements in the vector) I merely swap it with the last element and decrease a counter that keeps track of how many objects that are left.
On most implementations, decreasing the size of a vector won't cause it to reallocate - it will just reduce it's internal size counter to keep track of what's left, and then when you re-add an element, it won't need to reallocate either.
However, this behaviour isn't specified by the standard, so using a std::vector may reallocate when used like this. You could use a alternate implementation, like rde::vector so you know what it's going to do, or you could just keep your current system and change to [font=courier new,courier,monospace]Foo* fooVector = new Foo*[size];[/font] wink.png

I could go back to my old implementation, which avoided the issue of accessing my array of objects in an unordered fashion. The disadvantage to this is that I have to check for a condition for ALL my objects to decide if an object should be considered or not each time I go through my main loop. For example, assuming that I have 32 objects and call a method on these objects 60 times per second I would be making 32*60 if-calls in just one second.
The solution of keeping track of 'active' objects via a list of pointers is a good solution, but for the sake of alternatives:
If you've got 32 players, you can use a 32-bit integer as a collection of booleans to tell which ones are active. To then iterate through only the active players, you can use a loop like:for( u32 mask = activePlayers; mask; mask &= mask-1 )//while bits set, do loop body, then clear least significant bit
{
int offset = LeastSignificantBit(mask);//implemented via _BitScanForward on MSVC
players[offset].Update();
}
Larger collections can use an array of 32-bit masks (or 64-bit masks would be optimal for x64 builds).
My vector consist of 32 pointers whereas each pointer occupy 4 bytes. So in total 128 bytes. I remember reading that there are architectures that use cache line sizes of 128 bytes but it is a bit unrealistic to expect this for all PC's. I think it would be better to assume a smaller size such as 32 bytes which would require four cache lines just for the vector.[/quote]You can use this function to test each of your development PCs to see what their cache-line size is. A 32KiB cache split into 128B lines is probably a good guess for a lot of PCs. These days the numbers are probably larger, not smaller. [edit]Actually I just tested my 4core Q6600 from 2007, and each core has two 32KiB L1 caches (instruction/data) with 64B lines, and each pair of cores has a 4MiB L2 cache also with 64B lines.[/edit]
N.B. if your 128B vector isn't aligned to a 128B memory address (i.e. if [font=courier new,courier,monospace]pointer&0x7F != pointer[/font]), then it will be split over two cache lines. so iterating through your vector should cause at most 2 cache misses.
So at the moment a cache line can only hold one object (quick question, since the object occupy 20 bytes of the 32 available bytes, will the rest be used for the neighboring object or will it end up being unused, which of course isn't a good thing but not always avoidable).[/quote]A 128B cache-line will 'download' 128B of RAM. Whatever is in that RAM is downloaded. If your object only takes up half the line, then whatever is in the other half-a-line of RAM will now be in the cache.
As for the array which hold my objects (32 objects), each object take up 20 bytes so in total 640 bytes.[/quote]If you know you're going to iterate through this whole array, you could be greedy and just prefetch the whole 5/6 lines in advance.
e.g.for( const char* data = (char*)players, *end = data+sizeof(Player)*size; data<end; data += cacheLineSize )
Prefetch( data );
...but, adding prefetch commands is really something that you want to do after and during profiling, and you also want to test it carefully to make sure you've not done more harm than good.

In your case, you could just sort your vector of player pointers before updating, and now you'll be iterating through your player array in linear order (which hopefully convinces the CPU to prefetch for you automatically).

On most implementations, decreasing the size of a vector won't cause it to reallocate - it will just reduce it's internal size counter to keep track of what's left, and then when you re-add an element, it won't need to reallocate either.
However, this behaviour isn't specified by the standard, so using a std::vector may reallocate when used like this. You could use a alternate implementation, like rde::vector so you know what it's going to do, or you could just keep your current system and change to Foo* fooVector = new Foo*[size]
[/quote]

That's true. I think I for some reason mixed up the re-allocation with the moving of data inside the vector biggrin.png. What I really meant to say is that if I remove an element in the middle of my vector I would create a gap which results in all other elements, following after the element that is removed, to be moved one step down which results in the assignment operator (or the copy ctor, I forgot) of the element to be called x number of times depending on the amount of objects. One of the reasons to why I went with a swapping method.


The solution of keeping track of 'active' objects via a list of pointers is a good solution, but for the sake of alternatives:
If you've got 32 players, you can use a 32-bit integer as a collection of booleans to tell which ones are active. To then iterate through only the active players, you can use a loop like:

for( u32 mask = activePlayers; mask; mask &= mask-1 )//while bits set, do loop body, then clear least significant bit
{
int offset = LeastSignificantBit(mask);//implemented via _BitScanForward on MSVC
players[offset].Update();
}


Larger collections can use an array of 32-bit masks (or 64-bit masks would be optimal for x64 builds).
[/quote]

I'm probably being really stupid here but how is this code exactly working out? From the way I understand it I have my 32 objects and when all of them are active all bits are supposed to be set. A quick look at _BitScanForward says that it search for the LSB that is set so each loop I would find this index and use it when I updated my player. Do I understand it correct? Also I understand you said this is just an alternative but I was wondering how it would differ in general from having something like this:


for(int i=0; i<32; i++)
{
if(object.use()) //If object is to be processed...
{
object.update() //do stuff with object...
}
}




You can use this function to test each of your development PCs to see what their cache-line size is. A 32KiB cache split into 128B lines is probably a good guess for a lot of PCs. These days the numbers are probably larger, not smaller. [edit]Actually I just tested my 4core Q6600 from 2007, and each core has two 32KiB L1 caches (instruction/data) with 64B lines, and each pair of cores has a 4MiB L2 cache also with 64B lines.[/edit]
N.B. if your 128B vector isn't aligned to a 128B memory address (i.e. if pointer&0x7F != pointer), then it will be split over two cache lines. so iterating through your vector should cause at most 2 cache misses.
[/quote]

Very helpful code, thank you smile.png The reason I'm assuming smaller cache (and smaller cache lines) is due to my target hardware probably being worse than your 4core Q6600. I don't have any specifications at hand but let's say that the hardware I would want to test on only have one core. Perhaps it is still wrong to assume a 32 byte cache line but that is outside my knowledge. But yes if it's 128 bytes it would split over two cache lines, even more so if it's 64 or 32 bytes but I suppose that's still quite good. It also makes me wonder, if I know beforehand how many objects I will have, wouldn't I be better off creating an array as a member variable of my Player (call it Player for simplicity) class that will hold variables that are often updated? To clarify, say I have this


struct Player
{
Vector2 position; //Updated often

//some other variables...
};


and that I create 32 instances of this struct, but that I will most of the time be updating the position variable, wouldn't I be better of doing something like this to fit in more data that will be used in a cache line:


struct Player
{
Vector2* array = new Vector2[size] //Updated often and will hold positions for 32 objects...

//create array for the others variables that are not used as much...
}



A 128B cache-line will 'download' 128B of RAM. Whatever is in that RAM is downloaded. If your object only takes up half the line, then whatever is in the other half-a-line of RAM will now be in the cache.
[/quote]

Thanks! I still do have to ask though...if for example three foo objects can be fit in one cache line then when reading foo[1] could foo[0] be put in that cache line as well? Or would I only be reading anything that comes after foo[1] (foo[2], foo[3]...)


If you know you're going to iterate through this whole array, you could be greedy and just prefetch the whole 5/6 lines in advance.
e.g.

for( const char* data = (char*)players, *end = data+sizeof(Player)*size; data<end; data += cacheLineSize )
Prefetch( data );

...but, adding prefetch commands is really something that you want to do after and during profiling, and you also want to test it carefully to make sure you've not done more harm than good.

In your case, you could just sort your vector of player pointers before updating, and now you'll be iterating through your player array in linear order (which hopefully convinces the CPU to prefetch for you automatically).
[/quote]

I do agree that I should do profiling first before getting into prefetching. I suppose the same thing could be said about sorting. Sorting would decrease possible cache misses but then it's a question on how much penalty I receive for sorting so many times. That might be a bit hard to know without profiling as well.


Really appreciate all the help (and explanations) given here, although going by what we have so far the next step seems to be profiling and to decide from that how to change my code based on what's been discussed here. While I'm at it, what's a good profiler to use? I'm using Windows XP and AMD Athlon X2 Dual Core 4600+

I'm probably being really stupid here but how is this code exactly working out? From the way I understand it I have my 32 objects and when all of them are active all bits are supposed to be set. A quick look at _BitScanForward says that it search for the LSB that is set so each loop I would find this index and use it when I updated my player. Do I understand it correct? Also I understand you said this is just an alternative but I was wondering how it would differ in general from having something like this:

The loop keeps looping until the mask becomes 0. Every iteration the first bit that's set to 1 is set to 0, going from least significant to most significant. The offset value is simply the position of the first bit set to 1, again going from LSB to MSB, this gives you a value between 0 and 31 for a 32 bit mask. It's quite ingenious I must say.

How this differs from your example is that you have a single memory address to check for availability, instead of the many addresses within each object. If I'm correct, the cache will not only load the bytes to those addresses, but also some bytes around them(that you might not need), quickly filling up your cache-lines. Please correct me if I'm wrong.

The loop keeps looping until the mask becomes 0. Every iteration the first bit that's set to 1 is set to 0, going from least significant to most significant. The offset value is simply the position of the first bit set to 1, again going from LSB to MSB, this gives you a value between 0 and 31 for a 32 bit mask.
Yep, so this means that it only loops the number of times that there are bits set. If there are 2 active players, the loop body only runs twice just for those 2 players, as opposed to running 32 times and failing the [font=courier new,courier,monospace]object{i}.use()[/font] test 30 of those times. (square brackets broke my post formatting there, yay)
I use this technique quite a bit for larger collections -- e.g. if I've got 100 potential objects, but usually only a handful of them need processing, then looping over 4 u32's and checking their bits is a huge time-saver.
For an example of taking this to the extreme, check out this InactiveArray<T> class; it groups every 64 elements of the array under a 64-bit mask, then also groups every 64 masks under another mask, and so on, forming a pyramid of masks specifying which regions of the array are active.

How this differs from your example is that you have a single memory address to check for availability, instead of the many addresses within each object. If I'm correct, the cache will not only load the bytes to those addresses, but also some bytes around them(that you might not need), quickly filling up your cache-lines. Please correct me if I'm wrong.
Yeah, I've seen in quite a lot of games, stuff like [font=courier new,courier,monospace]struct Player { ...lots of members... bool active; ... lots more members... };[/font] which means that when you want to loop through all the objects and test if they're active, you're loading 32 cache-lines that are spaced really far apart, almost guaranteeing 32 cache misses. You should at least break the active flag out into a [font=courier new,courier,monospace]bool active[32][/font], and at that point, a 32-bit integer is logically the same thing.
While I'm at it, what's a good profiler to use? I'm using Windows XP and AMD Athlon X2 Dual Core 4600+
Intel VTune is really good, but is only an option if you've got the money to invest in it, or are unscrupulous regarding copyright... AMD has a free alternative called CodeAnalyst, and you could also try "very sleepy" and gperftools.

also makes me wonder, if I know beforehand how many objects I will have, wouldn't I be better off creating an array as a member variable of my Player (call it Player for simplicity) class that will hold variables that are often updated? To clarify, say I have this
...
and that I create 32 instances of this struct, but that I will most of the time be updating the position variable, wouldn't I be better of doing something like this to fit in more data that will be used in a cache line:...
Yes OOP is bad for teaching you that all data that belongs to the same logical object should be stored in the same physical location, as this can be very harmful for the cache. Look up "data-oriented design" and "pitfalls of OOP" for tips regarding this topic.
Generally you want to split out and group the members based upon the transformations (methods) that use them. If physics uses the objects' shape+mass+position, culling uses the position+AABB, and game-logic uses position+health+energy, then you should probably split the data so that each of those 3 "methods" only have to download the data that they're going to use. If we take the intersections of their requirements, we get 4 arrays: {shape+mass}, {position}, {AABB}, {health+energy}
Sorting would decrease possible cache misses but then it's a question on how much penalty I receive for sorting so many times.[/quote]Yeah, you're talking about optimising so profiling is a must. With the sorted/unsorted case, it's pretty easy to test, as the logic for using your array of pointers doesn't change. The only difference is whether you sort the array of pointers beforehand or not (or maybe whether you perform sorted insertions when activating items).
N.B. sorting small (<100,000 items) arrays is surprisingly fast, so don't be too scared of it. My quick tips for being cache friendly would be: Use large linear allocations, sort the data, and batch your operations to perform the same task on a whole group at once (no [font=courier new,courier,monospace]virtual void OnEvent[/font] going off to random places during a batch -- batch these up until afterwards, and then sort them by callback, and again execute them as a batch on a group of data).
I still do have to ask though...if for example three foo objects can be fit in one cache line then when reading foo[1] could foo[0] be put in that cache line as well? Or would I only be reading anything that comes after foo[1] (foo[2], foo[3]...)[/quote]Yes if foo[1] is in the middle or end of a line and foo[0] is at the start of the line, then fetching foo[1] will also fetch foo[0], as the whole must be fetched from RAM.
Apologies for the late reply. But I think I got help with the stuff I was wondering about, not only do I have a better understanding of cache locality but I also got some great suggestions on how to solve some problems that might be common, thanks lots guys. Very much appreciated :)

This topic is closed to new replies.

Advertisement