Sign in to follow this  
LemonScented

Unity Cache optimisations for beginners

Recommended Posts

I posted a question recently asking about ways to improve the performance of some fluid physics I'm working on. I was looking for improvements in the algorithm, but it was suggested to me that I should also look into optimising my code with regards to the CPU cache, and avoiding cache misses. I'm an experienced C++ programmer, but writing stuff that's as performance-critical as complex physics code is somewhat new to me, and it struck me that I don't really know where to begin with making such optimisations, so I thought I'd throw the question out there. I know the really basic general stuff, such as "try to keep stuff you'll be processing together as close as possible in memory", "store the results of calculations or pointer lookups locally if you'll be using them a few times", and I know the difference between the "good" way to iterate through a 2-dimensional array and the "bad" way, but where else can I go from here? - Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered) - What are the most common causes of poor cache behaviour, and how can they be rectified? - I'm led to believe that caching works best when dealing with structures that are 2^n in size. Is there an upper-limit to this size, or any other stuff I should know about that? - Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject? I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?

Share this post


Link to post
Share on other sites
Memory caches are built around two assumptions: locality of space and locality of time. The first means that if a program accesses one memory location it's very likely that it will access memory locations near by. The second means that if a program access one memory location it's likely to access that same memory location again in the near future. The more your program behavior matches these assumptions, the more efficient memory caching will be. To improve spatial locality, you can do things like using sequential memory containers rather than node based containers (ex: using std::vector rather than std::list) and use values rather than pointers when you can. To improve temporal locality you do things like processing everything you need on A before moving on to B, C ... ZZZ rather than do f(A), f(B) ... f(ZZZ) then g(A), g(B) ... g(ZZZ), and so on.

Some profilers include information about cache misses such as VTune and AMD CodeAnalyst. I don't recall if the MSVC 2008 profiler reports that information, but that's only available if you're using one of the team editions.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?
My understanding is that cache optimisation is largely guess work combined with rules of thumb.

You can do things like using contiguous sequences of data to maximise spatial locality and minimising the amount of jumping about you do (reduce pointers/references, reduce random access into other arrays/structs and transform recursive functions into iterative ones). Likewise if you're iterating over a sequence of data then try and perform as much of the processing per iterative step (rather than repeat iterations) to benefit from temporal locality optimisations. SiCrane descibed both of these already.

You can also try fiddling the code-wise arrangement of data members to benefit from better packing. You can also separate frequently used data members into their own contiguous blocks separate from their less frequently used contemporaries for improved spatial locality.

Share this post


Link to post
Share on other sites
The big thing is really just to be aware of the cache, in light of your data access patterns.

A typical cache line is 64 bytes in length for an X86 CPU -- what this means is that whenever you hit main memory, all of that 64-byte "page" will be in the L1 cache, giving you very fast access -- so what you want to do is cram as much of the necessary information as possible into that 64-bytes -- this requires both appropriately-sized data and appropriately aligned data. Be aware, however, that some CPUs (particularly older models or different architectures) may have different cache line sizes.

Recent Intel CPUs (Pentium D, Core, Core 2) are 8-way/16-way set associative (L1 and L2 caches respectively) which means that it has 8 entries per 64-byte "page". You may be wondering why this is, and the reason is that each of these 8-element sets aligns to several 64-byte pages in main memory -- 8-way means that the CPU caches 8 pages from main memory which align on the same cache tag. The more "ways" a CPU cache has, the less frequently that line will be evicted from the cache -- but the effectiveness of a particular caching setup can't be judged by numbers alone, much of it has to do with overall cache size, CPU and memory subsystems, latencies, and typical workloads. Just assume that your CPU manufacturer went through a lot of trouble deciding on the optimal setup.

Then, there's cache aliasing. Aliasing is the frequency with which "pages" in main memory map to sets within the cache. Taking a typical L1 cache size of 32K (recent Intel CPUs meantioned above) We divide by the "ways" of the cache, 8 in case of the L1, to arrive at 4K -- what this means is that for every 64-byte page in main memory which maps to a particular set, each page which is an integer multiple of 4096 bytes away also maps to that set, and that this holds true for the entirety of main memory installed in the system. What this means in practical terms, is that if you're accessing data aligned on these aliasing bounds more than the number of "ways" your CPU cache has before you're done using those pages, you'll pre-maturely evict the cache lines. This is sometimes called "thrashing" the cache, though that term is also applied more loosely a general mis-use of the cache.

Data structures that are a power of two in size are good -- though that's less in-and-of-itself, and more because cache line sizes are powers of two in size, the real issue is that you want your data structures to fit nicely in the cache -- If cache lines were 80 bytes, then good structure sizes would be 2, 4, 5, 8, 10, 16, 20, 40, and 80 bytes. Practically speaking, it would be very esoteric for a CPU to adopt a cache line that was not a power of two -- I simply use the example of 80 bytes as an illustration for the reasoning behind powers of two being good in real systems.

One more "advanced" technique for designing your data structures is to divide them into "hot" and "cold" data sets. This is beneficial when the structure consists of some data which is frequently accessed in performance critical areas (the hot set), and some which is rarely accessed, or accessed in non-performance-critical areas (the cold set). It works best when the "hot" data, plus an additional pointer to the cold data (or just the hot data if the cold data can be organized in a separate structure) can be made to fit the cache better. For example, say you have a structure of hot and cold data that is 10 32-bit integers, or 40 bytes -- You're only able to fit 1.6 of these structures per cache-line. If we are able to identify 3 "hot" integers, and add a 32-bit pointer to a structure containing the "cold" integers, then our hot structure is only 16 bytes -- we can now fit exactly 4 of these structures into a cache line.

I'd really recommend you read up on the details of how caches work to gain insight on how to structure your code and data access patterns. Some concepts you should try to understand in some depth are: Associativity, Aliasing, Tagging, Line-Size, Exclusive vs. Inclusive, the relationship between cache levels L1<->L2<->L3, and the relationship between the highest-level cache and main-memory.

Finally, be aware that caching setups do vary, for example, AMD Athlon X2 CPUs have 64k, 2-Way set associative L1 caches. The more general rules-of-thumb mentioned above (spacial and temporal locality of reference) ought to give good results across all platforms -- I wouldn't attempt to optimize aggressively for one particular caching setup unless your platform is predictable (console, handheld, embedded system development), or you are optimizing towards a range of CPUs that have similar caching setups.

Share this post


Link to post
Share on other sites
Always optimize by choosing better algorithms, first. It's only when you KNOW you have the fastest possible algorithms in place, that you begin to worry about this stuff.

There are two stages to this process. One is done by touch and feel - put your data close together (aka, arrays for many things), and put the references to the data close together (avoid hopping all over the array where possible). Sort data by frequency of reference instead of other keys (experimentally: it doesn't always help).

Rethink memory allocation - try to allocate what you need up front, and avoid a lot of free/allocate cycles, as that's a cache-killer. Order stack variables in a function from most used to least used, but push large arrays to the bottom. Don't recurse when you can iterate. Arrange the data elements on structs and classes in decreasing size order (doubles and long long at the top, then int, then short, and so on) to get tight but aligned packing, in the hopes of fitting it all in cache. The previous poster mentioned segregating hot and cold data - this can be a huge win, but make very sure you know which is cold or you will make things much worse. Use unsigned char or short when you need to store small integers; don't use double if float will do. Fuss with inlining; you don't want to be constantly hopping from function to function, but if you overdo it you'll end up with functions that don't fit in cache, and hurt yourself. Read the assembler listings - optimizers DO make mistakes and sometimes rearranging code will do impressive shrinks in the number of instructions needed.

It's all small stuff, but it can add up to a measurable improvement. Measure after every change.

Sometimes you do counterintuitive things. I remember AMD recommending doing tricks with asynch reads to "warm the processor cache ahead of use"; it worked, but it was so ugly I wished it didn't.

If that doesn't get you what you need, the next stage is specialized tools for your environment, and it gets hard. Hope you never have to go there. Sometimes it's just cheaper to move up in hardware, if you have that choice.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
fluid physics


Since this is one of the problems people have been solving for a very long time, it may make sense to look into an existing library. In general, linear algebra and related topics have been optimized to oblivion already.

Quote:
- Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered)


AMD CodeAnalyst and Intel VTune.

Quote:
- I'm led to believe that caching works best when dealing with structures that are 2^n in size.


In general no. The unit of measure is cache line, 64 bytes. Alignment however does matter, which is why structures are often padded or designed to fit nicely into cache lines.

Quote:
- Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject?


I found this overview (PPT) to cover the problem domain well.

Share this post


Link to post
Share on other sites
1 - before doing anything, profile
2- after doing anything, profile
3- only do one thing at once
4- only fix things that show up in the profile.

so, you may not actually have a problem at this point, check your profile output first.
the profile will also show up exactly what type of cache problems you have (if you have any).

algorithmic changes will work far better than small instruction based changes.

data layout and access patterns will be the biggest target area for cache problems.

Matt D

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628300
    • Total Posts
      2981894
  • Similar Content

    • By CpawsMusic
      Hey there! My name is Cpaws and I compose music, sound effects and overall sound design for video games and film. I'm looking to work on a horror game mainly for fun and as an addition to my portfolio. I've used Unity & Wwise before for audio implementation.
      If you're interested in working together, don't hesitate to contact me at cpawsmusic@gmail.com
      Here's a few demo reels of my past projects: 
      Here's some snippets of some quick game music I've composed: https://soundcloud.com/cpawsmusic/sets/cpaws-video-game-film-music
      Here's my portfolio/website: https://CpawsMusic.com/
      E-mail: cpawsmusic@gmail.com
    • By muhamad rabee
      My first mobile game made by unity

      iphone: https://itunes.apple.com/us/app/aa-countdown/id1314223584?ls=1&mt=8

      android: https://play.google.com/store/apps/details?id=com.mr.AACountDown

      I appreciate every suggestion
    • By Simplepg
      https://play.google.com/store/apps/details?id=simple.gplay.GamesInABox
    • By Simplepg
      https://play.google.com/store/apps/details?id=simple.gplay.GamesInABox
    • By ForgedInteractive


      Who We Are
      We are Forged Interactive, a small team of like-minded game developers with the sole purpose of making games we love! We're a team of artists, animators, programmers, level designers, writers, composers, producers, and other creative minds. We want to make games that you, the modern gamer want to play! We hope to build a community that enjoys our games as much as we love creating them. With your feedback and support we will be able to achieve that.

      About the Game
      GAME NAME is a fun, action-packed army builder with unique characters, challenges and engaging levels. Set forth on an adventure to protect friends, family and countrymen from new adversaries. Once defeated your enemies turn coat and join you in your adventures. Players can enjoy a range of troops and abilities based on their gameplay style which become more important as maps introduce more challenging terrain, enemies and bosses. Strong orc knights, dangerous shamans, and even a dragon are out on the prowl. Knowing when to fight and when to run, and how to manage your army is essential. Your actions alone decide the fate of this world.

      Previous Work by Team
      Although we are working towards our first game as a team, our team members themselves have past experience in the industry.
      This includes members who have worked on titles including:
      Final Fantasy Kingsglaive, FIFA, Xcom 2 and Civilization.

      Who are we looking for? 3D Modellers Concept Artists Marketing Specialists Level Designer

      What do we expect? Reference work or portfolio. Examples what have you already done and what projects you have worked on academic or otherwise. The ability to commit to the project on a regular basis. If you are going on a two-week trip, we don't mind, but it would be good if you could commit 10+ hours to the project each week. Willingness to work with a royalty based compensation model, you will be paid when the game launches. Openness to learning new tools and techniques
      What can we offer? Continuous support and availability from our side. You have the ability to give design input, and creative say in the development of the game. Shown in credits on websites, in-game and more. Insight and contacts from within the Industry.
      Contact
      If you are interested in knowing more or joining. Please email or PM us on Skype. Myself or Colin will reply to you within 48 hours.

      E-mail: Recruitment@ForgedInteractive.com
      Skype: ForgedInteractive

      Regards,
      David and Colin

      Follow us on:

      Facebook: https://www.facebook.com/ForgedInteractive/
      Twitter: @ForgedInteract
      Youtube: https://www.youtube.com/channel/UCpK..._as=subscriber
      Reddit: https://www.reddit.com/user/Forged_Interactive/
  • Popular Now