How to actually measure stuff like cache hit, prefetch, ram traversal?

Started by
12 comments, last by WoopsASword 8 years, 2 months ago

Hi Guys,

This question is mostly out of pure interest.

Reading a lot of topics on this forum about game engine design and code, I often come across posts that say something like "this way you are more likely to load the next bit of data into the memory that the cpu will need, increasing cache coherency...", usually in topics about data oriented design (DOD).

From my understanding, DOD is something to do with stuff being in contiguous memory that the cpu cycles over each game loop, and it seems easier to achieve this using DOD rather than inheritance?

Secondly... how can you actually measure this stuff? I know in a very very simply game engine it doesn't really matter, but I have a learning game written using both ECS and just OOP that essentially does the same thing. (sprite moving around, shoots, gets shot at by enemies). How can I compare which implementation is actually better, and what areas of each one are working well?

I hope my question makes sense.

Thanks

Advertisement
Intel and AMD both have some amazing tools available to help measure those things and read all the internal details of their respective processors.

AMD's is a free tool called CodeXL, Intel's is a mostly paid tool called VTune. VTune is potentially zero-cost for academics and open source uses, plus they've got 30-day trials that some people rely on for short-term use.

Tools for other processors and chipsets may exist, depending on what you are looking for.

I often come across posts that say something like "this way you are more likely to load the next bit of data into the memory that the cpu will need, increasing cache coherency...",


Mistrust any advice about DoD or performance who use the term cache coherency in that context, as they clearly have no idea what the words they're using mean. tongue.png

From my understanding, DOD is something to do with stuff being in contiguous memory that the cpu cycles over each game loop, and it seems easier to achieve this using DOD rather than inheritance?


Sort of. Data-oriented design is, specifically, understanding what your data is and how it's operated upon.

If you know that you're going to be looping over your data frequently and don't have many other concerns then pick a data structure that makes iterating over that data cheap and efficient. That's it. DoD is not some specific magic wand or specific pattern. It's a way of thinking through problems. And not the only way, nor the most important. It's just one tool in your toolbox.

Sean Middleditch – Game Systems Engineer – Join my team!

In regards to your, how do you measure the stuff comment:

If this is purely out of interest, you will find your answer is platform dependent. A great starting point is the Intel Software Tuning, Performance Optimization, and Platform Monitoring forums assuming you are on an Intel platform. If you spend some time on the forum, you'll find it pretty amazing at some of the work that goes into measuring these variables. It is a great place to ask these kinds of questions but also the how do I fix it kind.


VTune is potentially zero-cost for academics

Decided to try because apparently it is free for students. Got BSOD the moment I clicked "Start Analysis". Guess I'm not using it ever again.


Mistrust any advice about DoD or performance who use the term cache coherency in that context, ...

That's a good point, I'm sure I've made that mistake myself, and I know I've seen it made numerous times in articles and forum threads alike. I seems like a word that might mean what they meant, but its not. I guess something like cache locality is more apt, or perhaps temporal locality more generally.

To me, data-oriented-design boils down to two things -- firstly, the understanding that memory accesses (specifically, cache-line loads and stores to main memory), not CPU cycles, are the true bottleneck in most heavy computations, and secondly, following from the first, that the shape and flow of performance-critical data should be the prime concern, even above algorithmic, micro-optimization, and dogmatic object-oriented design considerations. What DOD is NOT, is recasting every inconsequential corner of your application to be "cache friendly".

DoD is often cast in opposition to classical OOP techniques because OOP places a lot of emphasis on human-centered object modelling, and its true that you have to cut against the grain of much OOP programming advice, but OOP is still a valuable tool in implementing DoD architecture, and in the rest of your program.

throw table_exception("(? ???)? ? ???");


That's a good point, I'm sure I've made that mistake myself, and I know I've seen it made numerous times in articles and forum threads alike. I seems like a word that might mean what they meant, but its not. I guess something like cache locality is more apt, or perhaps temporal locality more generally.

I think I've heard people use the term "coherent access's" maybe it stems from that?

-potential energy is easily made kinetic-


That's a good point, I'm sure I've made that mistake myself, and I know I've seen it made numerous times in articles and forum threads alike. I seems like a word that might mean what they meant, but its not. I guess something like cache locality is more apt, or perhaps temporal locality more generally.

I think I've heard people use the term "coherent access's" maybe it stems from that?

That is a related topic, yes.

Understanding exactly how memory works in modern computers is a fairly big topic.

Decades ago it was easy enough: You had main memory and you had a processor. There was no cache. If you needed something it was fetched from memory.

Then they added a cache. Early caches were a few bytes. Then 16 bytes, then 32 bytes, then bigger and bigger.

Then there were more levels of caches. You could buy dedicated external cache that logically sat between your real main memory and your CPU.

Then it became popular to add a second CPU. And more cache chips. Cache had both CPU-integrated and external chips.

Then each CPU gained additional levels of cache that needed to be kept coherent.

These days you have multiple levels of cache feeding in to potentially multiple physical processors that all have their own caches, feeding into potentially multiple virtual processors that also potentially have caches. Then inside the processor all the instructions are broken down and reordered anyway, and the CPU will predict where your memory accesses will be so it can prefetch them before the instructions are fully decoded. Modern hardware does lots of amazing things.

Any time you modify something somewhere all the other caches that know about the value need to be updated to hold the right value. If you update processor 3's data cache for a memory address it needs to eventually go out to any other processors and caches that also have that memory address.

Good data-oriented design means understanding where all the copies of the object are, trying to minimize copies so ideally it is a single chain directly from main memory and not copies spread to every processor, and trying to ensure data is always available in the correct cache when you need it. In addition to solid software development background and understanding data structures and algorithms, that also means a good understanding of physical hardware and the hardware configurations that seem to change a little with every new hardware generation.

I've heard so far getting data from a "foreign" cache takes as long as going to main memory. I used to think the MOESI protocol would accelerate such access's but I guess not.

-potential energy is easily made kinetic-


That's a good point, I'm sure I've made that mistake myself, and I know I've seen it made numerous times in articles and forum threads alike. I seems like a word that might mean what they meant, but its not. I guess something like cache locality is more apt, or perhaps temporal locality more generally.

Spatial locality, actually. Which I think goes a long way towards proving how complicated the whole subject really is to most programmers. :)

This topic is closed to new replies.

Advertisement