• Advertisement
Sign in to follow this  

Unity Cache optimisations for beginners

This topic is 3135 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I posted a question recently asking about ways to improve the performance of some fluid physics I'm working on. I was looking for improvements in the algorithm, but it was suggested to me that I should also look into optimising my code with regards to the CPU cache, and avoiding cache misses. I'm an experienced C++ programmer, but writing stuff that's as performance-critical as complex physics code is somewhat new to me, and it struck me that I don't really know where to begin with making such optimisations, so I thought I'd throw the question out there. I know the really basic general stuff, such as "try to keep stuff you'll be processing together as close as possible in memory", "store the results of calculations or pointer lookups locally if you'll be using them a few times", and I know the difference between the "good" way to iterate through a 2-dimensional array and the "bad" way, but where else can I go from here? - Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered) - What are the most common causes of poor cache behaviour, and how can they be rectified? - I'm led to believe that caching works best when dealing with structures that are 2^n in size. Is there an upper-limit to this size, or any other stuff I should know about that? - Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject? I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?

Share this post


Link to post
Share on other sites
Advertisement
Memory caches are built around two assumptions: locality of space and locality of time. The first means that if a program accesses one memory location it's very likely that it will access memory locations near by. The second means that if a program access one memory location it's likely to access that same memory location again in the near future. The more your program behavior matches these assumptions, the more efficient memory caching will be. To improve spatial locality, you can do things like using sequential memory containers rather than node based containers (ex: using std::vector rather than std::list) and use values rather than pointers when you can. To improve temporal locality you do things like processing everything you need on A before moving on to B, C ... ZZZ rather than do f(A), f(B) ... f(ZZZ) then g(A), g(B) ... g(ZZZ), and so on.

Some profilers include information about cache misses such as VTune and AMD CodeAnalyst. I don't recall if the MSVC 2008 profiler reports that information, but that's only available if you're using one of the team editions.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?
My understanding is that cache optimisation is largely guess work combined with rules of thumb.

You can do things like using contiguous sequences of data to maximise spatial locality and minimising the amount of jumping about you do (reduce pointers/references, reduce random access into other arrays/structs and transform recursive functions into iterative ones). Likewise if you're iterating over a sequence of data then try and perform as much of the processing per iterative step (rather than repeat iterations) to benefit from temporal locality optimisations. SiCrane descibed both of these already.

You can also try fiddling the code-wise arrangement of data members to benefit from better packing. You can also separate frequently used data members into their own contiguous blocks separate from their less frequently used contemporaries for improved spatial locality.

Share this post


Link to post
Share on other sites
The big thing is really just to be aware of the cache, in light of your data access patterns.

A typical cache line is 64 bytes in length for an X86 CPU -- what this means is that whenever you hit main memory, all of that 64-byte "page" will be in the L1 cache, giving you very fast access -- so what you want to do is cram as much of the necessary information as possible into that 64-bytes -- this requires both appropriately-sized data and appropriately aligned data. Be aware, however, that some CPUs (particularly older models or different architectures) may have different cache line sizes.

Recent Intel CPUs (Pentium D, Core, Core 2) are 8-way/16-way set associative (L1 and L2 caches respectively) which means that it has 8 entries per 64-byte "page". You may be wondering why this is, and the reason is that each of these 8-element sets aligns to several 64-byte pages in main memory -- 8-way means that the CPU caches 8 pages from main memory which align on the same cache tag. The more "ways" a CPU cache has, the less frequently that line will be evicted from the cache -- but the effectiveness of a particular caching setup can't be judged by numbers alone, much of it has to do with overall cache size, CPU and memory subsystems, latencies, and typical workloads. Just assume that your CPU manufacturer went through a lot of trouble deciding on the optimal setup.

Then, there's cache aliasing. Aliasing is the frequency with which "pages" in main memory map to sets within the cache. Taking a typical L1 cache size of 32K (recent Intel CPUs meantioned above) We divide by the "ways" of the cache, 8 in case of the L1, to arrive at 4K -- what this means is that for every 64-byte page in main memory which maps to a particular set, each page which is an integer multiple of 4096 bytes away also maps to that set, and that this holds true for the entirety of main memory installed in the system. What this means in practical terms, is that if you're accessing data aligned on these aliasing bounds more than the number of "ways" your CPU cache has before you're done using those pages, you'll pre-maturely evict the cache lines. This is sometimes called "thrashing" the cache, though that term is also applied more loosely a general mis-use of the cache.

Data structures that are a power of two in size are good -- though that's less in-and-of-itself, and more because cache line sizes are powers of two in size, the real issue is that you want your data structures to fit nicely in the cache -- If cache lines were 80 bytes, then good structure sizes would be 2, 4, 5, 8, 10, 16, 20, 40, and 80 bytes. Practically speaking, it would be very esoteric for a CPU to adopt a cache line that was not a power of two -- I simply use the example of 80 bytes as an illustration for the reasoning behind powers of two being good in real systems.

One more "advanced" technique for designing your data structures is to divide them into "hot" and "cold" data sets. This is beneficial when the structure consists of some data which is frequently accessed in performance critical areas (the hot set), and some which is rarely accessed, or accessed in non-performance-critical areas (the cold set). It works best when the "hot" data, plus an additional pointer to the cold data (or just the hot data if the cold data can be organized in a separate structure) can be made to fit the cache better. For example, say you have a structure of hot and cold data that is 10 32-bit integers, or 40 bytes -- You're only able to fit 1.6 of these structures per cache-line. If we are able to identify 3 "hot" integers, and add a 32-bit pointer to a structure containing the "cold" integers, then our hot structure is only 16 bytes -- we can now fit exactly 4 of these structures into a cache line.

I'd really recommend you read up on the details of how caches work to gain insight on how to structure your code and data access patterns. Some concepts you should try to understand in some depth are: Associativity, Aliasing, Tagging, Line-Size, Exclusive vs. Inclusive, the relationship between cache levels L1<->L2<->L3, and the relationship between the highest-level cache and main-memory.

Finally, be aware that caching setups do vary, for example, AMD Athlon X2 CPUs have 64k, 2-Way set associative L1 caches. The more general rules-of-thumb mentioned above (spacial and temporal locality of reference) ought to give good results across all platforms -- I wouldn't attempt to optimize aggressively for one particular caching setup unless your platform is predictable (console, handheld, embedded system development), or you are optimizing towards a range of CPUs that have similar caching setups.

Share this post


Link to post
Share on other sites
Always optimize by choosing better algorithms, first. It's only when you KNOW you have the fastest possible algorithms in place, that you begin to worry about this stuff.

There are two stages to this process. One is done by touch and feel - put your data close together (aka, arrays for many things), and put the references to the data close together (avoid hopping all over the array where possible). Sort data by frequency of reference instead of other keys (experimentally: it doesn't always help).

Rethink memory allocation - try to allocate what you need up front, and avoid a lot of free/allocate cycles, as that's a cache-killer. Order stack variables in a function from most used to least used, but push large arrays to the bottom. Don't recurse when you can iterate. Arrange the data elements on structs and classes in decreasing size order (doubles and long long at the top, then int, then short, and so on) to get tight but aligned packing, in the hopes of fitting it all in cache. The previous poster mentioned segregating hot and cold data - this can be a huge win, but make very sure you know which is cold or you will make things much worse. Use unsigned char or short when you need to store small integers; don't use double if float will do. Fuss with inlining; you don't want to be constantly hopping from function to function, but if you overdo it you'll end up with functions that don't fit in cache, and hurt yourself. Read the assembler listings - optimizers DO make mistakes and sometimes rearranging code will do impressive shrinks in the number of instructions needed.

It's all small stuff, but it can add up to a measurable improvement. Measure after every change.

Sometimes you do counterintuitive things. I remember AMD recommending doing tricks with asynch reads to "warm the processor cache ahead of use"; it worked, but it was so ugly I wished it didn't.

If that doesn't get you what you need, the next stage is specialized tools for your environment, and it gets hard. Hope you never have to go there. Sometimes it's just cheaper to move up in hardware, if you have that choice.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
fluid physics


Since this is one of the problems people have been solving for a very long time, it may make sense to look into an existing library. In general, linear algebra and related topics have been optimized to oblivion already.

Quote:
- Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered)


AMD CodeAnalyst and Intel VTune.

Quote:
- I'm led to believe that caching works best when dealing with structures that are 2^n in size.


In general no. The unit of measure is cache line, 64 bytes. Alignment however does matter, which is why structures are often padded or designed to fit nicely into cache lines.

Quote:
- Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject?


I found this overview (PPT) to cover the problem domain well.

Share this post


Link to post
Share on other sites
If you know your access patterns then you should prefetch your data before using it.

E.g.

for(unint i = 0; i < abignumber; ++i) {
prefetch(myarray + CACHE_LINE_SIZE);

}

Share this post


Link to post
Share on other sites
1 - before doing anything, profile
2- after doing anything, profile
3- only do one thing at once
4- only fix things that show up in the profile.

so, you may not actually have a problem at this point, check your profile output first.
the profile will also show up exactly what type of cache problems you have (if you have any).

algorithmic changes will work far better than small instruction based changes.

data layout and access patterns will be the biggest target area for cache problems.

Matt D

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By Manuel Berger
      Hello fellow devs!
      Once again I started working on an 2D adventure game and right now I'm doing the character-movement/animation. I'm not a big math guy and I was happy about my solution, but soon I realized that it's flawed.
      My player has 5 walking-animations, mirrored for the left side: up, upright, right, downright, down. With the atan2 function I get the angle between player and destination. To get an index from 0 to 4, I divide PI by 5 and see how many times it goes into the player-destination angle.

      In Pseudo-Code:
      angle = atan2(destination.x - player.x, destination.y - player.y) //swapped y and x to get mirrored angle around the y axis
      index = (int) (angle / (PI / 5));
      PlayAnimation(index); //0 = up, 1 = up_right, 2 = right, 3 = down_right, 4 = down

      Besides the fact that when angle is equal to PI it produces an index of 5, this works like a charm. Or at least I thought so at first. When I tested it, I realized that the up and down animation is playing more often than the others, which is pretty logical, since they have double the angle.

      What I'm trying to achieve is something like this, but with equal angles, so that up and down has the same range as all other directions.

      I can't get my head around it. Any suggestions? Is the whole approach doomed?

      Thank you in advance for any input!
       
    • By devbyskc
      Hi Everyone,
      Like most here, I'm a newbie but have been dabbling with game development for a few years. I am currently working full-time overseas and learning the craft in my spare time. It's been a long but highly rewarding adventure. Much of my time has been spent working through tutorials. In all of them, as well as my own attempts at development, I used the audio files supplied by the tutorial author, or obtained from one of the numerous sites online. I am working solo, and will be for a while, so I don't want to get too wrapped up with any one skill set. Regarding audio, the files I've found and used are good for what I was doing at the time. However I would now like to try my hand at customizing the audio more. My game engine of choice is Unity and it has an audio mixer built in that I have experimented with following their tutorials. I have obtained a great book called Game Audio Development with Unity 5.x that I am working through. Half way through the book it introduces using FMOD to supplement the Unity Audio Mixer. Later in the book, the author introduces Reaper (a very popular DAW) as an external program to compose and mix music to be integrated with Unity. I did some research on DAWs and quickly became overwhelmed. Much of what I found was geared toward professional sound engineers and sound designers. I am in no way trying or even thinking about getting to that level. All I want to be able to do is take a music file, and tweak it some to get the sound I want for my game. I've played with Audacity as well, but it didn't seem to fit the bill. So that is why I am looking at a better quality DAW. Since being solo, I am also under a budget contraint. So of all the DAW software out there, I am considering Reaper or Presonus Studio One due to their pricing. My question is, is investing the time to learn about using a DAW to tweak a sound file worth it? Are there any solo developers currently using a DAW as part of their overall workflow? If so, which one? I've also come across Fabric which is a Unity plug-in that enhances the built-in audio mixer. Would that be a better alternative?
      I know this is long, and maybe I haven't communicated well in trying to be brief. But any advice from the gurus/vets would be greatly appreciated. I've leaned so much and had a lot of fun in the process. BTW, I am also a senior citizen (I cut my programming teeth back using punch cards and Structured Basic when it first came out). If anyone needs more clarification of what I am trying to accomplish please let me know.  Thanks in advance for any assistance/advice.
    • By Yosef BenSadon
      Hi , I was considering this start up http://adshir.com/, for investment and i would like a little bit of feedback on what the developers community think about the technology.
      So far what they have is a demo that runs in real time on a Tablet at over 60FPS, it runs locally on the  integrated GPU of the i7 . They have a 20 000 triangles  dinosaur that looks impressive,  better than anything i saw on a mobile device, with reflections and shadows looking very close to what they would look in the real world. They achieved this thanks to a  new algorithm of a rendering technique called Path tracing/Ray tracing, that  is very demanding and so far it is done mostly for static images.
      From what i checked around there is no real option for real time ray tracing (60 FPS on consumer devices). There was imagination technologies that were supposed to release a chip that supports real time ray tracing, but i did not found they had a product in the market or even if the technology is finished as their last demo  i found was with a PC.  The other one is OTOY with their brigade engine that is still not released and if i understand well is more a cloud solution than in hardware solution .
      Would there  be a sizable  interest in the developers community in having such a product as a plug-in for existing game engines?  How important  is Ray tracing to the  future of high end real time graphics?
    • By bryandalo
      Good day,

      I just wanted to share our casual game that is available for android.

      Description: Fight your way from the ravenous plant monster for survival through flips. The rules are simple, drag and release your phone screen. Improve your skills and show it to your friends with the games quirky ranks. Select an array of characters using the orb you acquire throughout the game.

      Download: https://play.google.com/store/apps/details?id=com.HellmodeGames.FlipEscape&hl=en
       
      Trailer: 
       
    • By khawk
      Watch the latest from Unity.
       
  • Advertisement