• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Suen

Swapping addresses between pointers and cache locality

9 posts in this topic

Assume I have a vector that consist of pointers. When the content of this vector is read (one element at a time in order) I should be reducing cache misses (since the pointers would be adjacent to each other in memory). Now if we assume what each pointer is POINTING at also lies in a continuous block of memory then from what I've understood it should also reduce cache misses (corrections here would be welcomed). To clarify with some code:

[CODE]
Foo* foo = new Foo[size];
std::vector<Foo*> fooVector;

for(int i=0; i<size; i++)
{
//fill fooVector with &foo[i]
}
[/CODE]

Now let's assume that during run-time an event occurs and each time this event occurs a pointer in the vector is being processed. Some operations are done on the value being pointed at by this pointer and then a swap is done. The swap takes the address being pointed at by the current processed pointer and swap it with the address being pointed at by the last pointer in the vector. Note that 'last' here doesn't necessarily mean the last element in the vector, there could be some variable which keeps track of what it considers to be the "last element" in the vector, this is not important though as the point here is that the addresses that the pointers are pointing at are being swapped between the pointers. Again some code to clarify what I mean:

[CODE]
void processPointer(int pointerId)
{
//do some operations on foo object being pointed at with fooVector[pointerId]

//Swap the addresses between two pointers
std::swap(fooVector[pointerId], fooVector[lastElement]);
}

[/CODE]

Now as far as I am concerned, the next time the vector is read the pointers are still adjacent in memory so there is no change there. What they are pointing at becomes unordered with time though, the first element in the vector could be pointing at the third element in the array allocated from the heap and the third element in the vector could be pointing at the tenth element in the array and so on. How does this affect cache locality when the values, which the pointers are pointing at, are read when everything is unordered? I realize that the values that are being pointed at still belong to the same continuous block but a cache line doesn't consist of unlimited memory.

To elaborate on the last sentence above I see it this way: say we end up in a situation where fooVector[0] points at foo[3] and fooVector{1] point at foo[10]. Now when I go through my vector, step by step and perform calculations on the values being pointed at then when I process fooVector[0] I could perhaps be reading foo[3], foo[4] and foo[5] (perhaps something before foo[3] as well, I don't honestly know if it looks "behind" it or not) and that might be as much as I could fit in one cache line. When I process fooVector[1] then foo[10] will not be in the cache and thus I would be incurring a cache miss. I realize this entirely depends on the actual size of my objects (a foo object in this case) but this is how I thought of it.

I could be wrong about many things here, I'm fairly new when it comes to considering the cache when coding so I could have assumed many things which are wrong, if so please let me know.
0

Share this post


Link to post
Share on other sites
So basically you have two contiguous blocks of memory, one for the pointers and one for the objects and you are accessing the objects in an unordered fashion? That could very well lead to an increased number of cache misses, but who knows? Is there an underlying problem you're trying to solve?
0

Share this post


Link to post
Share on other sites
[quote name='Mussi' timestamp='1347144016' post='4978116']
So basically you have two contiguous blocks of memory, one for the pointers and one for the objects and you are accessing the objects in an unordered fashion? That could very well lead to an increased number of cache misses, but who knows? Is there an underlying problem you're trying to solve?
[/quote]

Yes it's pretty much as you said. Two contiguous blocks of memory but one of them ends up being accessed in an unordered fashion as time goes by. As for an underlying problem, I'm trying to avoid having cache misses as much as possible. It is just a goal I want to be able to fulfill as good as possible.

The problem here is that my vector is supposed to only hold pointers that point to objects which I will perform operations on in my program. When I want to stop considering an object I get "rid" of the corresponding pointer. Instead of removing the pointer from the vector (and possibly causing a re-allocation of all remaining elements in the vector) I merely swap it with the last element and decrease a counter that keeps track of how many objects that are left. This gives me two advantages: I do not constantly have to remove and add pointers to my vector and I will only access objects which must be changed, any object whose pointer I swapped earlier will not be considered (as in they won't even be processed until a specific event occurs).

The alternative way would be to add/remove pointers from my vector or even add/remove objects from a vector (as both of these would avoid the whole swap-deal I mentioned) but this is not very suitable in my case as I would be doing many allocation/de-allocations in a short time (not to speak of all data that has to be moved around each time an object is removed for example) and that has it's own problems. I posted more about this in a thread I created in the game programming section, although that thread wasn't exclusivle related to this issue only. The way I have it now allow my objects and the pointers to be allocated on the heap once and once only throughout the whole life of the program. But yes my current solution results in the unordered access as explained.

I could go back to my old implementation, which avoided the issue of accessing my array of objects in an unordered fashion. The disadvantage to this is that I have to check for a condition for ALL my objects to decide if an object should be processed or not each time I go through my main loop. For example, assuming that I have 32 objects and call a method on each of these objects 60 times per second I would be making 32*60 if-calls in just one second. That's just for updating my object, I have to do the same when I render my objects (and the rendering is done many more times per second than my update).

I suppose it's quite hard to really know which one that would affect the overall performance more without really doing any profiling. Ultimately I would want to avoid cache misses more than branches which kind of answers my own question as to what solution I should go with but then I want the best of the both.

Anyway sorry to go off-topic about this. I was just wondering if the unordered access of my objects would increase the amount of cache misses I would have. Edited by Suen
0

Share this post


Link to post
Share on other sites
[quote]
Yes and yes. Since you know how cache lines work, I assume that both your vector's data (pointers) and its pointed data is larger than a single cache line (otherwise you are good no matter what you do, since both arrays are contiguous).[/quote]

My vector consist of 32 pointers whereas each pointer occupy 4 bytes. So in total 128 bytes. I remember reading that there are architectures that use cache line sizes of 128 bytes but it is a bit unrealistic to expect this for all PC's. I think it would be better to assume a smaller size such as 32 bytes which would require four cache lines just for the vector. As for the array which hold my objects (32 objects), each object take up 20 bytes so in total 640 bytes. I could reduce the object size to 16 bytes by using a datatype that takes up less but I've decided to not go that route yet before I do some profiling later. So at the moment a cache line can only hold one object (quick question, since the object occupy 20 bytes of the 32 available bytes, will the rest be used for the neighboring object or will it end up being unused, which of course isn't a good thing but not always avoidable). I realize we're talking about really small sizes here and most people might say it won't make any differences performance-wise but I have my set of goals I want to try and fulfill.

[quote]
But the real question is, unless I missed something, why do you need to swap the pointers in the first place, and how often do you process pointers and/or iterate the array?
[/quote]

See reply above. To add to it, I currently iterate through my vector way above 60+ times per second (the array is rarely iterated since most updates on the objects in the array are done when I process the pointers pointing at them, at most it is iterated once every 2-3 minutes)

[quote]
And also, how large are these arrays? The cache consists of several cache lines, so if they fits into the collected cache, it might not become an issue at all.
[/quote]

Should be answered in the first paragraph if I'm not mistaken. But again, even if most of it were to be able to fit in the L1 cache I want to avoid having it as an excuse to make bad design choices ;)

[quote]
Of course, if you have a real case, and this is not just hypothetical, you really should profile with a profiler that can highlight cache usage/misses.
[/quote]

Well this really seems to be the best choice although I just wanted to make sure I had understood my underlying problem correct (that I'm, at least theoratically, incurring more cache misses with my current solution). Edited by Suen
0

Share this post


Link to post
Share on other sites
[quote]
On most implementations, decreasing the size of a vector won't cause it to reallocate - it will just reduce it's internal size counter to keep track of what's left, and then when you re-add an element, it won't need to reallocate either.
However, this behaviour isn't specified by the standard, so using a std::vector may reallocate when used like this. You could use a alternate implementation, like rde::vector so you know what it's going to do, or you could just keep your current system and change to Foo* fooVector = new Foo*[size]
[/quote]

That's true. I think I for some reason mixed up the re-allocation with the moving of data inside the vector [img]http://public.gamedev.net//public/style_emoticons/default/biggrin.png[/img]. What I really meant to say is that if I remove an element in the middle of my vector I would create a gap which results in all other elements, following after the element that is removed, to be moved one step down which results in the assignment operator (or the copy ctor, I forgot) of the element to be called x number of times depending on the amount of objects. One of the reasons to why I went with a swapping method.

[quote]
The solution of keeping track of 'active' objects via a list of pointers is a good solution, but for the sake of alternatives:
If you've got 32 players, you can use a 32-bit integer as a collection of booleans to tell which ones are active. To then iterate through only the active players, you can use a loop like:

[CODE]for( u32 mask = activePlayers; mask; mask &= mask-1 )//while bits set, do loop body, then clear least significant bit
{
int offset = LeastSignificantBit(mask);//implemented via _BitScanForward on MSVC
players[offset].Update();
}[/CODE]

Larger collections can use an array of 32-bit masks (or 64-bit masks would be optimal for x64 builds).
[/quote]

I'm probably being really stupid here but how is this code exactly working out? From the way I understand it I have my 32 objects and when all of them are active all bits are supposed to be set. A quick look at _BitScanForward says that it search for the LSB that is set so each loop I would find this index and use it when I updated my player. Do I understand it correct? Also I understand you said this is just an alternative but I was wondering how it would differ in general from having something like this:

[code]
for(int i=0; i<32; i++)
{
if(object[i].use()) //If object[i] is to be processed...
{
object[i].update() //do stuff with object[i]...
}
}
[/code]


[quote]
You can use this function to test each of your development PCs to see what their cache-line size is. A 32KiB cache split into 128B lines is probably a good guess for a lot of PCs. These days the numbers are probably larger, not smaller. [edit]Actually I just tested my 4core Q6600 from 2007, and each core has two 32KiB L1 caches (instruction/data) with 64B lines, and each pair of cores has a 4MiB L2 cache also with 64B lines.[/edit]
N.B. if your 128B vector isn't aligned to a 128B memory address (i.e. if pointer&0x7F != pointer), then it will be split over two cache lines. so iterating through your vector should cause at most 2 cache misses.
[/quote]

Very helpful code, thank you [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img] The reason I'm assuming smaller cache (and smaller cache lines) is due to my target hardware probably being worse than your 4core Q6600. I don't have any specifications at hand but let's say that the hardware I would want to test on only have one core. Perhaps it is still wrong to assume a 32 byte cache line but that is outside my knowledge. But yes if it's 128 bytes it would split over two cache lines, even more so if it's 64 or 32 bytes but I suppose that's still quite good. It also makes me wonder, if I know beforehand how many objects I will have, wouldn't I be better off creating an array as a member variable of my Player (call it Player for simplicity) class that will hold variables that are often updated? To clarify, say I have this

[code]
struct Player
{
Vector2 position; //Updated often

//some other variables...
};
[/code]

and that I create 32 instances of this struct, but that I will most of the time be updating the position variable, wouldn't I be better of doing something like this to fit in more data that will be used in a cache line:

[code]
struct Player
{
Vector2* array = new Vector2[size] //Updated often and will hold positions for 32 objects...

//create array for the others variables that are not used as much...
}
[/code]

[quote]
A 128B cache-line will 'download' 128B of RAM. Whatever is in that RAM is downloaded. If your object only takes up half the line, then whatever is in the other half-a-line of RAM will now be in the cache.
[/quote]

Thanks! I still do have to ask though...if for example three foo objects can be fit in one cache line then when reading foo[1] could foo[0] be put in that cache line as well? Or would I only be reading anything that comes after foo[1] (foo[2], foo[3]...)

[quote]
If you know you're going to iterate through this whole array, you could be greedy and just prefetch the whole 5/6 lines in advance.
e.g.

for( const char* data = (char*)players, *end = data+sizeof(Player)*size; data<end; data += cacheLineSize )
Prefetch( data );

...but, adding prefetch commands is really something that you want to do after and during profiling, and you also want to test it carefully to make sure you've not done more harm than good.

In your case, you could just sort your vector of player pointers before updating, and now you'll be iterating through your player array in linear order (which hopefully convinces the CPU to prefetch for you automatically).
[/quote]

I do agree that I should do profiling first before getting into prefetching. I suppose the same thing could be said about sorting. Sorting would decrease possible cache misses but then it's a question on how much penalty I receive for sorting so many times. That might be a bit hard to know without profiling as well.


Really appreciate all the help (and explanations) given here, although going by what we have so far the next step seems to be profiling and to decide from that how to change my code based on what's been discussed here. While I'm at it, what's a good profiler to use? I'm using Windows XP and AMD Athlon X2 Dual Core 4600+
0

Share this post


Link to post
Share on other sites
[quote name='Suen' timestamp='1347236530' post='4978435']
I'm probably being really stupid here but how is this code exactly working out? From the way I understand it I have my 32 objects and when all of them are active all bits are supposed to be set. A quick look at _BitScanForward says that it search for the LSB that is set so each loop I would find this index and use it when I updated my player. Do I understand it correct? Also I understand you said this is just an alternative but I was wondering how it would differ in general from having something like this:
[/quote]
The loop keeps looping until the mask becomes 0. Every iteration the first bit that's set to 1 is set to 0, going from least significant to most significant. The offset value is simply the position of the first bit set to 1, again going from LSB to MSB, this gives you a value between 0 and 31 for a 32 bit mask. It's quite ingenious I must say.

How this differs from your example is that you have a single memory address to check for availability, instead of the many addresses within each object. If I'm correct, the cache will not only load the bytes to those addresses, but also some bytes around them(that you might not need), quickly filling up your cache-lines. Please correct me if I'm wrong. Edited by Mussi
2

Share this post


Link to post
Share on other sites
Apologies for the late reply. But I think I got help with the stuff I was wondering about, not only do I have a better understanding of cache locality but I also got some great suggestions on how to solve some problems that might be common, thanks lots guys. Very much appreciated :)
1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0