How to avoid cache misses when data required is far away

Started by
11 comments, last by dmatter 4 years, 7 months ago

I have a list of entities and components which I iterate over in the game loop. Those entities are more often than not NPCs which can fire at each other from a distance. First NPCs are assigned a target by the overseeing AI which decides which target is the most appropriate for each NPC. Then, I iterate over each NPC to compute the firing logic at their target. Can you see the cache misses right here?

In pseudo code:


foreach (entity, all_entities) {       // accessed linearly
	// target is another entity in the linear structure, but it could be anywhere.
  	// from the next entity in the list, to the one a couple megabytes in RAM away 
       entity.fire_at(entity.target)    
}

image.png

 

How do I rid my code of such cache misses? Most of my logic relies on accessing random memory addresses like this and I've been struggling to refactor.

Advertisement

Questions:

  1. Is there actually a problem (i.e. are you getting performance problems?)?
  2. Are you sure this is causing the problem (have you profiled?)?

 

if you think programming is like sex, you probably haven't done much of either.-------------- - capn_midnight
1 hour ago, ChaosEngine said:

Questions:

  1. Is there actually a problem (i.e. are you getting performance problems?)?
  2. Are you sure this is causing the problem (have you profiled?)?

 

I do. It's either I have to reduce the scale of the game, or fix my cache misses.

For profiling, I used perf on linux


 perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./game

Result


 1,218,347,798      cache-references:u                                          
       587,238,750      cache-misses:u            #   48.200 % of all cache refs    
   162,551,053,111      cycles:u                                                    
   179,079,789,818      instructions:u            #    1.10  insn per cycle                                            
    23,451,271,748      branches:u                                                  
           211,403      faults:u                                                    
                 0      migrations:u                                                

      70.633332501 seconds time elapsed

      50.257961000 seconds user
       4.669893000 seconds sys

48.2% of all cache refs are misses, this seems incredibly high.

Note that I compiled with highest level of compiler optimisation.

Missing information : How much the loop you posted contribute to the 48.2% cache miss?

8 hours ago, aganm said:

How do I rid my code of such cache misses?

I think it is not really possible to bring "cache order" into a more or less random set of data. You could maybe sort the data a little bit, if a group of entities fires at each other, but this will introduce costs for sorting and the performance might get even worse. It is also hard to do cache optimizations with so little information about the involved classes and their interactions.

The best thing you can do is always to keep related data close in memory. If it is not possible, it is simply not possible.

 

1. What does fire_at() do?

Does it access the target object extensively?

If yes, can you reduce the accessed data to something small? I would imagine firing at an actor can be reduced to firing at a scene component style object whose size can be reduced to just a handful of bytes.

2. What is the "target"?

If the target is a remote actor and there's only a handful of targets per firing object, then can you split off an above-mentioned scene component style chunk and copy it locally to the actor that is firing? (note that this is kinda moot without answering point 3)

3. Are you dealing with actual scale or trying to optimize the wrong thing?

If yes to either, can you time-slice your logic?

Point in case: Starcraft (or any RTS game for that matter, though Starcraft is a really nice example because it is extremely competitive and therefore needs to aim its requirements at machines that run on potato batteries with presumably weak hardware) supports armies of up to 200 units (I'm rounding here) per team. In a minimalist 1-on-1 game you can therefore have up to 200 zerglings or whatnot per team that need to find their paths relatively intelligently on a map that may be fairly large. Things get much worse when you have 3, 4 or 8 teams.

Now, pathfinding is an inherently cache-unfriendly problem, which is why you cannot brute-force your way through this. You cannot take 200 zerglings and find paths for all of them in one timeslice (tick).

Therefore, any modern RTS game likely employs any number of tricks from path caching to multi-tier pathfinding to simplify this problem. But there is a simpler way: spread out your work load.

You can do this it two ways: laterally (on different threads) or temporally (timeslicing)

Does every object in your game fire at one or more enemy every tick? Probably (hopefully) not.

Here's where you could cheat! Unless you can substantially improve cache coherence (as suggested in points 1 and 2) or come up with a clever set up your data in a completely different way, just ignore the cache issue and solve the problem architecturally.

Option A: create a task system and assume there's at least one other thread that can take 30-50% of the load off the main thread. This is actually fairly easy if you synchronize your data at regular intervals and work on thread-local data sets.

Or, better yet, option B: place your firing squad in a priority queue. Assume that every tick you have up not N milliseconds to spend on logic and work through that queue as you do right now. When you run out of time, simply finish the tick and keep processing the queue during the next tick. Unless you fill up a gigantic queue (we're talking millions of objects firing at any number of things) that brings a modern CPU core to its knees for more than a full second, all you'll see are very slight potential delays before a shot is fired here or there. Chances are that in a real world scenario, parts of your unit logic "lag" a bit but that remains unnoticeable to the player since, hopefully, they're busy, you know, playing :D

 

 

1 hour ago, irreversible said:

1. What does fire_at() do?

2. What is the "target"?

3. Are you dealing with actual scale or trying to optimize the wrong thing?

If yes to either, can you time-slice your logic?

2. The target is another entity. My entities are layed out linearly in memory, so if entity #19 fires at #150, fire_at() has to jump back and forth these two. 

1. fire_at() fetches the target's position, then spawns a projectile object flying towards it.


Array all_entities;
  
foreach (me in all_entities) {
  me.fire_at(me.target) {
      float2 origin = this.pos;
      float2 destination = target->position; // cache miss

      // if the array keeps its internal data before the entities, may
      // cause another cache miss if iterated entity is too far away
      // since push back will have to increment some sort of counter,
      // which must access some other location in memory 
      all_entities.push_back( 
        
       // array has to access the end of the array to put this data into,
       // which may cause a cache miss if the end of the array is far from
       // the iterated entity.
       Bullet(origin, destination));   
  }
}

The spawning of a new object might sound worse than fetching a position, but both are problematic for the same reason: causing cache misses. The position fetch causes a cache miss because I access the target's position directly which is probably far in memory. So, maybe I can cache the target's position close to the shooter, but how? If I just make a copy of target's position, then I still am gonna have a cache miss when making the copy. Or I could make a copy only once in a while, except it's not gonna be accurate anymore because it could have moved by then.

3. I am dealing with actual scale. I have implemented some time slicing, but it's not enough. Really I wish to understand how I can organize my data layout better. I feel like I missed something.

2 hours ago, aganm said:

all_entities.push_back(

Do you reserve some space for all_entities?

I assume what drops your performance is the reallocation and copying that happens without having much control.)

What happens if you remove the push, but do something else to ensure both entities memory is still accessed (the compiler might optimize the whole function call away if you do nothing)? Then you would see if reading the data is the problem, or the writing of the data.

I also assume you have no custom container, using some ppoled memory or something to prevent frequent reallocations? Reallocations are performance killers, cache misses may be a minor issue in comparison.

You're getting 1.1 instructions executed per clock cycle, and griping about it?

You're spending 23% of your CPU time idle, and you say you're having a performance problem?

 

What are your actual per-frame times (not frame rates), in microseconds?

And when people tell you to profile your code, we mean in such a way you get a dump giving the total time spent in each function/method invoked during the program run, not just a single total time spent for the entire program run. That's how you identify bottlenecks. Counting cache misses is something you do on particular parts of your code, after you identify which parts are slow.

RIP GameDev.net: launched 2 unusably-broken forum engines in as many years, and now has ceased operating as a forum at all, happy to remain naught but an advertising platform with an attached social media presense, headed by a staff who by their own admission have no idea what their userbase wants or expects.Here's to the good times; shame they exist in the past.
11 minutes ago, Wyrframe said:

You're getting 1.1 instructions executed per clock cycle, and griping about it?

You're spending 23% of your CPU time idle, and you say you're having a performance problem?

 

What are your actual per-frame times (not frame rates), in microseconds?

And when people tell you to profile your code, we mean in such a way you get a dump giving the total time spent in each function/method invoked during the program run, not just a single total time spent for the entire program run. That's how you identify bottlenecks. Counting cache misses is something you do on particular parts of your code, after you identify which parts are slow.

Isn't 1.1 out of 4 possible clocks per cycle bad?

In a heavy load benchmark scenario, my frametime is ~90ms on a 3570k. In regular gameplay, ~7ms. The game still has many more features to add to it. If I get these numbers with only 30% of the gameplay logic implemented, this means I could very well get near 30ms with the completed game. My code does jumps all the time to access data, only I'm struggling to understand how to properly arrange my data for best cache locality because of the complexity. Just for my first example where an entity shoots at another. I can't figure out how a random entity can figure out another random entity's position without automatically causing a cache miss. Maybe it's impossible, that's what I'm trying to figure out.

This topic is closed to new replies.

Advertisement