Jump to content
  • Advertisement
Sign in to follow this  
LemonScented

Unity Cache optimisations for beginners

This topic is 3251 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I posted a question recently asking about ways to improve the performance of some fluid physics I'm working on. I was looking for improvements in the algorithm, but it was suggested to me that I should also look into optimising my code with regards to the CPU cache, and avoiding cache misses. I'm an experienced C++ programmer, but writing stuff that's as performance-critical as complex physics code is somewhat new to me, and it struck me that I don't really know where to begin with making such optimisations, so I thought I'd throw the question out there. I know the really basic general stuff, such as "try to keep stuff you'll be processing together as close as possible in memory", "store the results of calculations or pointer lookups locally if you'll be using them a few times", and I know the difference between the "good" way to iterate through a 2-dimensional array and the "bad" way, but where else can I go from here? - Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered) - What are the most common causes of poor cache behaviour, and how can they be rectified? - I'm led to believe that caching works best when dealing with structures that are 2^n in size. Is there an upper-limit to this size, or any other stuff I should know about that? - Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject? I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?

Share this post


Link to post
Share on other sites
Advertisement
Memory caches are built around two assumptions: locality of space and locality of time. The first means that if a program accesses one memory location it's very likely that it will access memory locations near by. The second means that if a program access one memory location it's likely to access that same memory location again in the near future. The more your program behavior matches these assumptions, the more efficient memory caching will be. To improve spatial locality, you can do things like using sequential memory containers rather than node based containers (ex: using std::vector rather than std::list) and use values rather than pointers when you can. To improve temporal locality you do things like processing everything you need on A before moving on to B, C ... ZZZ rather than do f(A), f(B) ... f(ZZZ) then g(A), g(B) ... g(ZZZ), and so on.

Some profilers include information about cache misses such as VTune and AMD CodeAnalyst. I don't recall if the MSVC 2008 profiler reports that information, but that's only available if you're using one of the team editions.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
I'm not quite sure which are the best questions to ask about this because as I say, beyond the "basics" it's not easy to know which avenues to explore. Anyone got any thoughts or advice?
My understanding is that cache optimisation is largely guess work combined with rules of thumb.

You can do things like using contiguous sequences of data to maximise spatial locality and minimising the amount of jumping about you do (reduce pointers/references, reduce random access into other arrays/structs and transform recursive functions into iterative ones). Likewise if you're iterating over a sequence of data then try and perform as much of the processing per iterative step (rather than repeat iterations) to benefit from temporal locality optimisations. SiCrane descibed both of these already.

You can also try fiddling the code-wise arrangement of data members to benefit from better packing. You can also separate frequently used data members into their own contiguous blocks separate from their less frequently used contemporaries for improved spatial locality.

Share this post


Link to post
Share on other sites
The big thing is really just to be aware of the cache, in light of your data access patterns.

A typical cache line is 64 bytes in length for an X86 CPU -- what this means is that whenever you hit main memory, all of that 64-byte "page" will be in the L1 cache, giving you very fast access -- so what you want to do is cram as much of the necessary information as possible into that 64-bytes -- this requires both appropriately-sized data and appropriately aligned data. Be aware, however, that some CPUs (particularly older models or different architectures) may have different cache line sizes.

Recent Intel CPUs (Pentium D, Core, Core 2) are 8-way/16-way set associative (L1 and L2 caches respectively) which means that it has 8 entries per 64-byte "page". You may be wondering why this is, and the reason is that each of these 8-element sets aligns to several 64-byte pages in main memory -- 8-way means that the CPU caches 8 pages from main memory which align on the same cache tag. The more "ways" a CPU cache has, the less frequently that line will be evicted from the cache -- but the effectiveness of a particular caching setup can't be judged by numbers alone, much of it has to do with overall cache size, CPU and memory subsystems, latencies, and typical workloads. Just assume that your CPU manufacturer went through a lot of trouble deciding on the optimal setup.

Then, there's cache aliasing. Aliasing is the frequency with which "pages" in main memory map to sets within the cache. Taking a typical L1 cache size of 32K (recent Intel CPUs meantioned above) We divide by the "ways" of the cache, 8 in case of the L1, to arrive at 4K -- what this means is that for every 64-byte page in main memory which maps to a particular set, each page which is an integer multiple of 4096 bytes away also maps to that set, and that this holds true for the entirety of main memory installed in the system. What this means in practical terms, is that if you're accessing data aligned on these aliasing bounds more than the number of "ways" your CPU cache has before you're done using those pages, you'll pre-maturely evict the cache lines. This is sometimes called "thrashing" the cache, though that term is also applied more loosely a general mis-use of the cache.

Data structures that are a power of two in size are good -- though that's less in-and-of-itself, and more because cache line sizes are powers of two in size, the real issue is that you want your data structures to fit nicely in the cache -- If cache lines were 80 bytes, then good structure sizes would be 2, 4, 5, 8, 10, 16, 20, 40, and 80 bytes. Practically speaking, it would be very esoteric for a CPU to adopt a cache line that was not a power of two -- I simply use the example of 80 bytes as an illustration for the reasoning behind powers of two being good in real systems.

One more "advanced" technique for designing your data structures is to divide them into "hot" and "cold" data sets. This is beneficial when the structure consists of some data which is frequently accessed in performance critical areas (the hot set), and some which is rarely accessed, or accessed in non-performance-critical areas (the cold set). It works best when the "hot" data, plus an additional pointer to the cold data (or just the hot data if the cold data can be organized in a separate structure) can be made to fit the cache better. For example, say you have a structure of hot and cold data that is 10 32-bit integers, or 40 bytes -- You're only able to fit 1.6 of these structures per cache-line. If we are able to identify 3 "hot" integers, and add a 32-bit pointer to a structure containing the "cold" integers, then our hot structure is only 16 bytes -- we can now fit exactly 4 of these structures into a cache line.

I'd really recommend you read up on the details of how caches work to gain insight on how to structure your code and data access patterns. Some concepts you should try to understand in some depth are: Associativity, Aliasing, Tagging, Line-Size, Exclusive vs. Inclusive, the relationship between cache levels L1<->L2<->L3, and the relationship between the highest-level cache and main-memory.

Finally, be aware that caching setups do vary, for example, AMD Athlon X2 CPUs have 64k, 2-Way set associative L1 caches. The more general rules-of-thumb mentioned above (spacial and temporal locality of reference) ought to give good results across all platforms -- I wouldn't attempt to optimize aggressively for one particular caching setup unless your platform is predictable (console, handheld, embedded system development), or you are optimizing towards a range of CPUs that have similar caching setups.

Share this post


Link to post
Share on other sites
Always optimize by choosing better algorithms, first. It's only when you KNOW you have the fastest possible algorithms in place, that you begin to worry about this stuff.

There are two stages to this process. One is done by touch and feel - put your data close together (aka, arrays for many things), and put the references to the data close together (avoid hopping all over the array where possible). Sort data by frequency of reference instead of other keys (experimentally: it doesn't always help).

Rethink memory allocation - try to allocate what you need up front, and avoid a lot of free/allocate cycles, as that's a cache-killer. Order stack variables in a function from most used to least used, but push large arrays to the bottom. Don't recurse when you can iterate. Arrange the data elements on structs and classes in decreasing size order (doubles and long long at the top, then int, then short, and so on) to get tight but aligned packing, in the hopes of fitting it all in cache. The previous poster mentioned segregating hot and cold data - this can be a huge win, but make very sure you know which is cold or you will make things much worse. Use unsigned char or short when you need to store small integers; don't use double if float will do. Fuss with inlining; you don't want to be constantly hopping from function to function, but if you overdo it you'll end up with functions that don't fit in cache, and hurt yourself. Read the assembler listings - optimizers DO make mistakes and sometimes rearranging code will do impressive shrinks in the number of instructions needed.

It's all small stuff, but it can add up to a measurable improvement. Measure after every change.

Sometimes you do counterintuitive things. I remember AMD recommending doing tricks with asynch reads to "warm the processor cache ahead of use"; it worked, but it was so ugly I wished it didn't.

If that doesn't get you what you need, the next stage is specialized tools for your environment, and it gets hard. Hope you never have to go there. Sometimes it's just cheaper to move up in hardware, if you have that choice.

Share this post


Link to post
Share on other sites
Quote:
Original post by LemonScented
fluid physics


Since this is one of the problems people have been solving for a very long time, it may make sense to look into an existing library. In general, linear algebra and related topics have been optimized to oblivion already.

Quote:
- Is there a profiling tool available to find out which parts of my code are the worst offenders for cache-misses? (I'm using Visual C++ 2008 - I mention this in case the IDE itself contains performance tools which I've not yet discovered)


AMD CodeAnalyst and Intel VTune.

Quote:
- I'm led to believe that caching works best when dealing with structures that are 2^n in size.


In general no. The unit of measure is cache line, 64 bytes. Alignment however does matter, which is why structures are often padded or designed to fit nicely into cache lines.

Quote:
- Does anyone have any "advanced tips" on the cache, or can anyone recommend a suitable resource on the subject?


I found this overview (PPT) to cover the problem domain well.

Share this post


Link to post
Share on other sites
If you know your access patterns then you should prefetch your data before using it.

E.g.

for(unint i = 0; i < abignumber; ++i) {
prefetch(myarray + CACHE_LINE_SIZE);

}

Share this post


Link to post
Share on other sites
1 - before doing anything, profile
2- after doing anything, profile
3- only do one thing at once
4- only fix things that show up in the profile.

so, you may not actually have a problem at this point, check your profile output first.
the profile will also show up exactly what type of cache problems you have (if you have any).

algorithmic changes will work far better than small instruction based changes.

data layout and access patterns will be the biggest target area for cache problems.

Matt D

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Similar Content

    • By ethancodes
      I'm working on a system for my game that will allow the player to stack pick ups in a queue. As one pick up expires, the next automatically activates. I'm having an issue though where if I pick up the first one, it activates fine, but if i pick up a second directly after it, it overrides the first one, activates the second one, and then once it has run it's course, everything goes back to normal gameplay, no first pick up. I'm not sure why this is happening. Hopefully someone can spot what I'm doing wrong in my code.
      Here is the code for the pick up manager:
      // Update is called once per frame void Update () { if (pickUpQueue.Count != 0 && !pickUpActive) { pickUpActive = true; pickUpQueue[0].ActivatePickUp(); } DeactivatePickUp(); } void DeactivatePickUp () { if (pickUpQueue.Count != 0 && pickUpActive) { Destroy (pickUpQueue [0]); pickUpQueue.RemoveAt (0); pickUpActive = false; } } And here is the PickUp:
      public override void ActivatePickUp () { ball.GetComponent<Ball>().Speed = 2.0f; //increase ball speed... ball.GetComponent<Ball>().StartCoroutine(timer); //...set time that power up is active }  
      There is also a Base Pick Up:
      public void OnCollisionEnter2D (Collision2D collision) { Vector2 tweak = new Vector2 (Random.Range(0f, 0.2f),Random.Range(0f, 0.2f)); this.gameObject.GetComponent<Rigidbody2D>().velocity += tweak; //if the pickup makes contact with the paddle or ball.... if (collision.gameObject.tag == "Paddle" || collision.gameObject.tag == "Ball") { GameObject.FindObjectOfType<GameManager>().GetComponent<PickUpManager>().pickUpQueue.Add(this); Destroy(gameObject); //...and finally destroy power up object } } As a side note, I am trying to find a solution to this that will work for all of my pickups. Some pickups are ammo based, some are timed. 
    • By D34DPOOL
      Edit Your Profile D34DPOOL 0 Threads 0 Updates 0 Messages Network Mod DB GameFront Sign Out Add jobEdit jobDeleteC# Programmer for a Unity FPS at Anywhere   Programmers located Anywhere.
      Posted by D34DPOOL on May 20th, 2018
      Hello, my name is Mason, and I've been working on a Quake style arena shooter about destroying boxes on and off for about a year now. I have a proof of concept with all of the basic features, but as an artist with little programming skill I've reached the end of my abilities as a programmer haha. I need someone to help fix bugs, optomize code, and to implent new features into the game. As a programmer you will have creative freedom to suggest new features and modes to add into the game if you choose to, I'm usually very open to suggestions :).
      What is required:
      Skill using C#
      Experience with Unity
      Experience using UNET (since it is a multiplayer game), or the effort and ability to learn it
      Compensation:
      Since the game currently has no funding, we can split whatever revenue the game makes in the future. However if you would perfer I can create 2D and/or 3D assets for whatever you need in return for your time and work.
      It's a very open and chill enviornment, where you'll have relative creative freedom. I hope you are interested in joining the team, and have a good day!
       
      To apply email me at mangemason@yahoo.com
    • By davejones
      Is there a way to automatically change the start position of an animation? I have a bunch of animations set up on 3D models in unity. The issue is that I need to move the 3D models, however when I do so the animation start positions are not updated and I have to do it manually.

      Changing the transform of key frames is time consuming with the amount of animations I have, so I was wondering if there was a way to do it automatically?
    • By MoreLion
      hey all! We are looking for members for our Unity horror game! 
      Here’s the story:
      After a deadly virus plunges the world into chaos killing 85% of the human population there are now what they call “zones” these zones are watched very closely by the surviving government, people are checked every day for the virus, even if you touch the spit or any human waste or fluids of the victim who is infected, you will die. But one day, people in the west zone start to go missing, 1 woman goes outside the walls to uncover the mystery, is there more to the virus than meets the eye?, That is where your story starts.
      This game is not a long development game, I have loads other game ideas,
      I will also allow you to have a bit of creative freedom if you wish to add or share a idea!
      And no, it’s not a zombie game lol I feel like zombie games are too generic, in this game you will encounter terrifying beasts!
      There is some concept art one of our concept artists have made
      If interested email liondude12@gmail.com
    • By Canadian Map Makers
      GOVERNOR is a modernized version of the highly popular series of “Caesar” games. Our small team has already developed maps, written specifications, acquired music and performed the historical research needed to create a good base for the programming part of the project.

      Our ultimate goal is to create a world class multi-level strategic city building game, but to start with we would like to create some of the simpler modules to demonstrate proof of concept and graphical elegance.

       

      We would like programmers and graphical artists to come onboard to (initially) create:

      A module where Province wide infrastructure can be built on an interactive 3D map of one of the ancient Roman Provinces.
      A module where city infrastructure can be built on a real 3D interactive landscape.
      For both parts, geographically and historically accurate base maps will be prepared by our team cartographer. Graphics development will be using Blender. The game engine will be Unity.

       

      More information, and examples of the work carried out so far can be found at http://playgovernor.com/ (most of the interesting content is under the Encyclopedia tab).

       

      This project represents a good opportunity for upcoming programmers and 3D modeling artists to develop something for their portfolios in a relatively short time span, working closely with one of Canada’s leading cartographers. There is also the possibility of being involved in this project to the point of a finished game and commercial success! Above all, this is a fun project to work on.

       

      Best regards,

      Steve Chapman (Canadian Map Makers)

       
    • By Scouting Ninja
      So I have hundreds of moving objects that need to check there speed. One of the reasons they need to check there speed is so they don't accelerate into oblivion, as more and more force is added to each object.
      At first I was just using the Unity vector3.magnitude. However this is actually very slow; when used hundreds of times.
      Next I tried the dot-product check:  vector3.dot(this.transform.foward, ShipBody.velocity) The performance boost was fantastic. However this only measures speed in the forward direction. Resulting in bouncing objects accelerating way past the allowed limit.
       
      I am hoping someone else knows a good way for me to check the speed with accuracy, that is fast on the CPU. Or just any magnitude calculations that I can test when I get home later.
       
      What if I used  vector3.dot(ShipBody.velocity.normalized, ShipBody.velocity)?
      How slow is it to normalize a vector, compared to asking it's magnitude?
    • By Ds ds
      Hi, my name is Andres, I'm a programmer with a technician degree and a Diploma in C#, looking for a project in Unity to start my career in game development. I don't do it for a paid but a recognition and start a portfolio, preferably a 2D game. Thanks for read, have a nice day. 
       
    • By Victor Rodriguez
      Hi there! Is the first time that I'm posting here so I'm sorry if I'm doing it wrong ha. 
      So here it comes, my doubt is, I'm doing a game with different levels, each of these levels in one different scene. Each scene contains to cameras that you can change pressing a button. Everything works fine. 
      The only problem is that I would like it to look a bit more professional, and I would like that if you finish the level with camera2, the next level start the same way. I've been thinking about using dontdestroyonloadon both cameras, but obviously this cameras need to be attached to the player to make the movement work, what do you recommend? Sorry If I've explained it in a messy way, and feel free to dm me for anything. Thanks in advance! 
    • By Ike aka Dk
      Hello everyone 
      I am a programmer from Baku.
      I need a 3D Modeller for my shooter project in unity.I have 2 years Unity exp.
      Project will paid when we finish the work 
      If you interested write me on email:
      mr.danilo911@gmail.com
    • By markoal
      Hi,
      I'm Unity developer from Croatia and I'm looking to work on the paid project in my spare time.
      I have 5+ years of experience in Unity and I'm familiar with almost anything, including all platforms (also Switch, PS4 and Xbox).
      Feel free to contact me.
  • Advertisement
  • Popular Now

  • Forum Statistics

    • Total Topics
      631367
    • Total Posts
      2999594
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!