Jump to content
Posted 08 May 2014 - 09:54 AM
Posted 08 May 2014 - 12:16 PM
Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.
Also check out my personal website at bryanwagstaff.com, where I occasionally write about assorted stuff.
Posted 08 May 2014 - 08:08 PM
Posted 08 May 2014 - 10:46 PM
I haven't used VS's profiler, but the more advanced ones out there will definitely let you configure what statistics/measurements you see -- such as cache misses incurred by each function.
You can get a long way just using theory - look up Cache-oblivious algorithms.
Step through your code and see what the memory access patterns are like. Have you taken cache behaviour into account when designing the code so far? Can it be redesigned so that the memory access patterns are either more linear or have much more locality? Do you run a single algorithm at a time, progressing in stages, or do you jump around a lot with things like calling virtual void OnHit() in the middle of collision resolution?
Posted 08 May 2014 - 11:43 PM
Can I look at cache hits or Misses?
CodeXL has support for cache line utilization if you run that type of profile session. Make sure you have the current version, the docs say it was added to the program last November. That is why I mentioned both CodeXL and CodeAnalyst, CodeXL is not mature in some ways, so having both is a good option.
Profilers give you numbers, not really recommendations. For cache use, it will give you a percentage called "cache line utilization", also "Total L1 eviction", "Bytes/L1 Eviction", "Accesses/L1 Eviction"
But that brings us to the other issue...
the only thing I could make from it was"here's where your method is spending the most time" which stops being helpful. ... [Cache is] pretty much the only other thing I'm aware I should be optimising for.
Optimizing is an art and science. It usually requires a lot of experience, plus a broad knowledge base about programming generally. You need to take profiler results, a bunch of numbers that give you statistics about performance, and turn them into actionable items. It is rather hard, even for people who do it regularly.
ALL NUMBERS IN THIS THOUGHT EXPERIMENT ARE MADE UP, BUT APPROXIMATELY CORRECT.
Sampling profilers just pause the program every few microseconds, look where it is, and unpause it. That means numbers are not exact. It also means you need to reason through the system very carefully. Often you will need to get several profile runs where the system is approximately the same, but it will still be slightly different every run, and you'll need to think in terms of generalities, approximations, averages, and Fermi estimates.
So you first run a sampling profile for one second during the middle of the game. You see a bunch of sample numbers. main() is taking 99.93% of the time. That is good, probably. It means most of the time the profiler stopped the program, it was stopped inside the program and not doing some other task. RunPhysics was sampled 44 times, DrawBoard was sampled 42 times, and Sprite::Draw was sampled 1536 times. Is that good? Or bad? Matrix44::Multiply was sampled 52134 times and is taking 47% of the time. Is that bad?
The problem is that the numbers might be good or bad, and it takes a deeper understanding of the system to understand it.
So with RunPhysics 44 times and DrawBoard 42 times, you probably had 45 or so frames. So you think about the frames. Is approximately 1536 calls to Sprite::Draw a reasonable number? It means about Is that about how many times you would expect it to be called? Is it being called too many times? Perhaps you know you have 32 sprites on the screen, and 1536 / 45 = 34, so it is approximately right. Even if the number matches the number of frames, you need to ask the deeper question of if that is reasonable for the hardware. Does that mean 1536 texture swaps? Is that reasonable for the hardware? Does that mean 1536 calls passing raw information over the bus, or is it 1536 calls using data already stored on the card?
You look at the number of matrix multiply operations. The 52134 seems pretty high, it means roughly 1200 matrix multiplies per frame. For only 32 sprites, that is probably a bit high. So you need to hunt down the call stacks to see if there is some unexpected work going on. Is it reasonable for RunPhysics to sample 1142 matrix multiplies per frame?
Is the length of time spent in MatrixMultiply reasonable? You can see that the 52134 samples takes 79472 CPU clocks. Is that good? Is that bad?
The profiler enables you to ask questions about your code. Interpreting the answers gets very tricky.
So let's say you think your MatrixMultiply is taking too much time. You then need to figure out an alternate solution. So you do some search on Google, find a better algorithm written with SIMD intrinsics.
You put the changes in to the program, run the program with the profiler, get to about the same spot in code, run the profiler for about a second, and you get a totally different set of performance numbers.
This time you do the numbers and figure you got approximately 55 frames. MatrixMultiply had 64129 samples and only took 48726 CPU clocks. Well, that is certainly an improvement. It is roughly half the time it used to be. Is that good enough? Can you do better? Since it is an improvement you might as well submit the change to version control. You look it over, figure you don't see how to improve the algorithm you found online, so move on to another item.
Your next oddity is the number of matrix multiplies in RunPhysics from the first run. When you divided it by 45 frames you estimated about 1142 matrix multiplies per frame. This time dividing the total by 55 frames it looks like 1092 matrix multiplies per frame. Again, is this reasonable? You think it is a bit high so you review the algorithms in your physics. After a bit of study you see something that is suspicious. For this thought exercise, we'll say it is a loop where every physics object is tested against every other physics object for collisions. You think that is a bit odd. That's an N-squared algorithm, so each new physics object greatly increases.
You think some more. What are your alternatives. An all-to-all test is pretty bad for large numbers, but in this case you have 32 physics objects. That means about 1024 comparisons. Is that really too much? Your game will never go bigger than that. There are several ways to reduce the number. You could add a spatial tree to only check nearby objects for collisions, but that means writing a spatial tree, keeping it updated on every physics change, and using it in various places. You can limit the tests to only objects marked as having moved, but that requires adding some flags and some extra processing elsewhere.
Continuing the thought, you also have the option to do nothing. Having 1024 tests really isn't that high of a number. The cost of adding and maintaining a spatial tree is pretty high, plus you'll need to spend the time to debug it, so that is probably out. Having a list of "dirty" objects, objects that have moved and require collision tests, it requires more tests than the spatial tree would require, but fewer tests than the all-to-all test. Since objects are frequently moving in your game, you figure that usually 15 to 20 tests will normally be required every frame, so it probably isn't worth the effort right now. You still make a note about it: "// TODO: Possibly add a spatial tree rather than all-to-all comparison." You also make a note on your trusty paper pad, so the code comment doesn't get forgotten in oblivion.
Continuing on since you mentioned cache specifically, you run a new profile sample for cache usage. You find a function that has a high number of L1 evictions. One of your search routines had about 12000 L1 evictions over the sample. So you start asking the questions again. Is that actually bad? Assuming it is actually bad, what techniques do you know that can help fix it?
You need to decide if the results can be improved, estimate how much you can improve them and how long it will take, and decide if the effort is worth it. It means you need to go mentally through your treasure trove of experience. What techniques exist that are cache friendly? Why are they cache friendly? How can we reduce the work? How can we eliminate the work entirely? Might those specific techniques work in this situation? You can also search online, looking for algorithms that match.
And THAT paragraph is the critical one for optimization.
What experience do you have in alternative solutions when the first one doesn't work? Some algorithms you can just plug into a search engine. Let Google find you a well-written matrix multiply, it is such a common algorithm, it is well documented, and there are some really cool implementations out there. But you can only rely on the Internet for a small number of algorithms. The job of programmers is to come up with algorithms to solve problems. Your game is unique, your implementation is unique, there are many unique challenges that nobody else has ever optimized, or if they have, they probably didn't blog about it.
The crux of optimization is that you MUST have enough experience to know alternatives. You also MUST have a knowledge of how all the parts work.
Only having one isn't enough. You may know something is slow but if you don't have the experience to form an alternative solution, you are stuck with the slow existing solution. Or, if you do know some alternative algorithms but you don't know how the parts work know why they have the performance characteristics they do, you are left to just guessing at each of the alternatives without any knowledge, blindly trying them until one hopefully is better.
If you have both experience and broad knowledge, you can use it in optimization. Let's say for some specific sorting condition you might have a lot of background knowledge. You can recall about 20 different sorting algorithms from your studies. Bubble sort is potentially an incredibly slow algorithm, with a large number of potential passes. But if you have a bit of knowledge, such as knowing that your data set is probably completely sorted, and if not sorted then nothing is off by more than a single swap, a slightly modified bubble sort could be your ideal algorithm. You also have knowledge that a single linear pass through the data is very cache friendly. You can take both the experience and knowledge, and construct a single pass algorithm that is based on the bubble sort, but in your specific case due to the pre-sorted collection you can guarantee that a single pass is enough. You can then convert a potentially horrible algorithm into a potentially amazing solution for a specific problem.
But in order to do that, you need both knowledge and experience. With both, you can wrap your brain around problems, figure out alternatives, estimate the costs and benefits of the alternatives, test them out. If they work, submit them, if they don't work, try again.
That is the process. It is also why profiling and optimizing code is a difficult task, usually left to the more experienced developers. There are almost always a few easily identified and easily corrected changes. It is can be obvious that when you have millions of calls to strlen() something is wrong and you can hunt down why. You can dabble in optimization and try to learn the steps. But once the easy stuff is gone, to really do it well requires some fairly rare skill sets, broad experience, broad knowledge, patience, and lots of thinking.
Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.