Jump to content
  • Advertisement
Sign in to follow this  
Rudibs

Cache performance

This topic is 3037 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I would like to code a "profiler" class that calculates the cache hit rate, cache hits and cache misses. is it possible to do, and if anyone know can they give me an direction to where to start?

Share this post


Link to post
Share on other sites
Advertisement
I don't believe it is possible. I'm pretty sure cache hits/misses are done at the processor level and that even the OS is oblivious to what's going on down in cache land.

There may be ways to do it but there are none that I know of and that aren't terribly processor-specific.

There's another post here that discusses cache optimization (particularly how it's pretty much irrelevant in a multitasking environment) which may be of some interest to you. And to be honest, I agree - cache optimization isn't something to worry about at all. Back in the DOS days they did worry about cache misses... but DOS wasn't a multitasking environment. And RAM access constituted up to a 90 ns pause. Nowadays that number is down to about 6 ns, not to mention the benefits of DDR (which doubles the access rate) and DDR2 (which doubles it again).

Share this post


Link to post
Share on other sites
AMD's CodeAnalyst profiler can do that. It all depends on the hardware itself. You have to find the exact set of assembly instructions for each processor that let you read the hardware counters. So, CodeAnalyst is probably only able to provide that information on an AMD processor.

Microsoft's PIX can also report all of that information for the XBox.
And Intel probably has a profiler that works on the intel chips.

Quote:

There's another post here that discusses cache optimization (particularly how it's pretty much irrelevant in a multitasking environment)

Even after reading the other post I'm going to have to disagree here. Optimizing your code for the cache will give you big gains even in a multitasking environment. Being cache friendly will get you more work done before your task is switched away.
_Sigma's post at the end of the page gives links to actually good details about how drastic the effects can be.

Share this post


Link to post
Share on other sites
Valgrind is another toolbox that can analyse cache behavior. It is open source so you should be able to peek in and have a look at how they did it. I never looked at the insides but I used it to analyse the access patterns of some code of mine and during the analysis, it _really_ slows down the application, as every memory access is taken a look at.

I have to disagree with cache optimizations being unimportant. As always, you shouldn't bother as long, as it's no bottleneck, but I once managed to speed up some code by 30% just by placing prefetch calls.

Share this post


Link to post
Share on other sites
Quote:
I don't believe it is possible.

Sure it is, every profiler does it.

AMD developer pages contain tons of examples on how to either use CodeAnalyst or provide examples on how to access this data programatically.

CPUs usually support probes which can be installed to report various events, many of them dedicated to cache. But these are very CPU model/make specific, so they aren't something obtained trivially.

The best bet for this are existing profilers.


Quote:
Original post by zilchonum
cache optimization isn't something to worry about at all.

No, cache optimization is the only optimization that matters today. Everything today is IO bound - including CPUs.

Quote:
Back in the DOS days they did worry about cache misses...

Back in the DOS days memory access latency was 1 cycle and memory-CPU clock was 1:1. CPUs did not have caches or pipelines at all. So nobody worried about caches. There were too many problems with memory segments and overlay juggling.

Things started changing with 486, but the fundamental turn happened with Pentium, when pipelines got absurdly deep and multi-level caches became the only way to improve performance.

Quote:
but DOS wasn't a multitasking environment. And RAM access constituted up to a 90 ns pause. Nowadays that number is down to about 6 ns, not to mention the benefits of DDR (which doubles the access rate) and DDR2 (which doubles it again).

Latency is around 100-200 CPU cycles.

It may not sound much - but that means several hundred mathematical operations that cannot be performed because CPU is stalled on memory access.

In practice, the difference between cache-friendly and cache-hostile IO bound algorithm is on the order of 8-10.

For graphics, this means difference between 8 and 60 FPS.
For web and database apps caches (DB cache, not CPU) means difference between 50 and 50000 concurrent users.

Share this post


Link to post
Share on other sites
thanks for the feedback. i appreciate it.

i did know some things about how cache works, but i easily forget. however i remember how to code it, but sometimes i am getting confused because i am not sure exactly how to avoid a cache miss.

actually i am coding only on cpu, i dont do gpu stuff (i did it in the past, but i thought it was boring so) i do like to experiment with coding graphics on cpu only. i have a few ideas in mind, and i would like to code in bigger resolutions than 320x :p so thats where cache comes in handy. i was coding a raytracer, and that i experimented alot and got it to flow quite fast, but still i want to be sure i am not doing something that may cause more cache misses. so just not to give alot of sourcecode for the raytracer i try and ask through some simple example:

like when i make struct like this:

typedef struct
{
char data1;
char data2;
short data3;
short data4;
short data5;
} cell; //which is a totall of 64 bytes in this cell

so when i then use this in a lookup table, i would like to make this as easy as possible for me to optimize it.

this is some cache information i got from a benchmark i runned:

Cache Information
Integrated Data Cache : 32kB, Synchronous, Write-Thru, 8-way, 64 byte line size
Integrated Instruction Cache : 32kB, Synchronous, Write-Back, 8-way, 64 byte line size
L2 On-board Cache : 1MB, ECC, Synchronous, ATC, 4-way, 64 byte line size, 2 threads sharing
L2 Cache Multiplier : 1x
7
i also read i have a L3 cache, so that means i have three caches for the cpu right?
i mostly only want to optimize the L1 cache since that is the fastest, but if i can optimize the L2 cache at the same time that is even better.

so when i lookup this memory cell.. how big should the cell be? 64 bytes since the line size is 64? or does L1 cache have some fixed size for each memory address? when i fetch a memory cell by using a structure it should be aligned, but again how big should the struct be?

a 32kb data cache that means its totall of 32kb for all the memory addresses, if the line size is 64 that means there are 32kb / 64 = 512 addresses to be fetched? and i should try to align them so that if i fetch like some texture coordinates it should be 512xHEIGHT, so that the alignment gets right? (silly example but anyway).

maybe this sounds stupid but i get really confused sometimes because i dont really know that much about cpu hardware and how it works, i sometimes only think it works that way, but it may not be true because of that. and also i hate reading through articles because there's so much other stuff that makes me bored and i dont bother reading any further.

Share this post


Link to post
Share on other sites
Quote:
Original post by Rudibs

typedef struct
{
char data1;
char data2;
short data3;
short data4;
short data5;
} cell; //which is a totall of 64 bytes in this cell

As a rule, members are always ordered largest to smallest.

Reading for structure padding and alignment is a good start here.

Quote:
so when i lookup this memory cell.. how big should the cell be? 64 bytes since the line size is 64? or does L1 cache have some fixed size for each memory address? when i fetch a memory cell by using a structure it should be aligned, but again how big should the struct be?

Coding for specific cache layout is impractical. And each CPU version makes some custom heuristics on what and how to fetch. Then there's threads and cores and whatnots.

One approach are cache-oblivious algorithms. IIRC, it was once proven that algorithm which is optimal for two-tier cache hierarchy is optimal across any number of tiers. But it's a fairly complicated topic.

Simpler way to approach it is - the memory chunk the code is currently working on should be as busy as possible - there should be as few fields that are not being used. Second guideline is to prefer sequential processing. Sometimes, two passes may be better than one.


Quote:
maybe this sounds stupid but i get really confused sometimes because i dont really know that much about cpu hardware and how it works, i sometimes only think it works that way, but it may not be true because of that. and also i hate reading through articles because there's so much other stuff that makes me bored and i dont bother reading any further.

Well, tough luck then. Because the minimum required reading, excluding CPU manuals are:
POOP (pdf)
alignment.Memory optimization (pdf).
Followed by Cache locality and related topics, CPU cache and perhaps TLB.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!