Cache misses and VTune

Started by
5 comments, last by WoopsASword 8 years, 2 months ago

Hello. I've started using VTune and read about cache coherent data. Not I'm trying to uderstand VTune how it works by making some simple programs that will violate the rules of cache coherent data.

For example:

 

int _tmain(int argc, wchar_t* argv[])
{
 
 
int numbers = 500000;
 
 
struct object {
float matrix[16];
float matrix2[16];
float matrix3[16];
float matrix4[16];
float matrix5[16];
float matrix6[16];
float matrix7[16];
float matrix8[16];
 
 
 
 
float position[3];
float uv[2];
 
 
};
 
 
 
 
object *vars =  new object[numbers];
object *vars2 = new object[numbers];
object *vars3 = new object[numbers];
object *vars4 = new object[numbers];
object *vars5 = new object[numbers];
 
 
 
 
 
 
float c;
for (int i = 0; i < numbers; i++) {
 
switch (rand() % 10) {
case(0):
c = vars[rand() % numbers].position[rand() % 15];
c = vars2[rand() % numbers].position[rand() % 15];
c = vars3[rand() % numbers].position[rand() % 15];
c = vars4[rand() % numbers].position[rand() % 15];
c = vars5[rand() % numbers].position[rand() % 15];
break;
 
case(1):
 
c = vars[rand() % numbers].matrix[rand() % 15];
c = vars2[rand() % numbers].matrix[rand() % 15];
c = vars3[rand() % numbers].matrix[rand() % 15];
c = vars4[rand() % numbers].matrix[rand() % 15];
c = vars5[rand() % numbers].matrix[rand() % 15];
break;
 
 
case(2) :
 
c = vars[rand() % numbers].matrix2[rand() % 15];
c = vars2[rand() % numbers].matrix2[rand() % 15];
c = vars3[rand() % numbers].matrix2[rand() % 15];
c = vars4[rand() % numbers].matrix2[rand() % 15];
c = vars5[rand() % numbers].matrix2[rand() % 15];
break;
 
 
 
case(3) :
 
c = vars[rand() % numbers].matrix3[rand() % 15];
c = vars2[rand() % numbers].matrix3[rand() % 15];
c = vars3[rand() % numbers].matrix3[rand() % 15];
c = vars4[rand() % numbers].matrix3[rand() % 15];
c = vars5[rand() % numbers].matrix3[rand() % 15];
break;
 
 
case(4) :
 
c = vars[rand() % numbers].matrix4[rand() % 15];
c = vars2[rand() % numbers].matrix4[rand() % 15];
c = vars3[rand() % numbers].matrix4[rand() % 15];
c = vars4[rand() % numbers].matrix4[rand() % 15];
c = vars5[rand() % numbers].matrix4[rand() % 15];
break;
 
case(5) :
 
c = vars[rand() % numbers].matrix5[rand() % 15];
c = vars2[rand() % numbers].matrix5[rand() % 15];
c = vars3[rand() % numbers].matrix5[rand() % 15];
c = vars4[rand() % numbers].matrix5[rand() % 15];
c = vars5[rand() % numbers].matrix5[rand() % 15];
break;
 
case(6) :
 
c = vars[rand() % numbers].matrix6[rand() % 15];
c = vars2[rand() % numbers].matrix6[rand() % 15];
c = vars3[rand() % numbers].matrix6[rand() % 15];
c = vars4[rand() % numbers].matrix6[rand() % 15];
c = vars5[rand() % numbers].matrix6[rand() % 15];
break;
 
case(7) :
 
c = vars[rand() % numbers].matrix7[rand() % 15];
c = vars2[rand() % numbers].matrix7[rand() % 15];
c = vars3[rand() % numbers].matrix7[rand() % 15];
c = vars4[rand() % numbers].matrix7[rand() % 15];
c = vars5[rand() % numbers].matrix7[rand() % 15];
break;
 
case(8) :
 
c = vars[rand() % numbers].matrix8[rand() % 15];
c = vars2[rand() % numbers].matrix8[rand() % 15];
c = vars3[rand() % numbers].matrix8[rand() % 15];
c = vars4[rand() % numbers].matrix8[rand() % 15];
c = vars5[rand() % numbers].matrix8[rand() % 15];
break;
 
 
case(9) :
 
c = vars[rand() % numbers].uv[rand() % 1];
c = vars2[rand() % numbers].uv[rand() % 15];
c = vars3[rand() % numbers].uv[rand() % 15];
c = vars4[rand() % numbers].uv[rand() % 15];
c = vars5[rand() % numbers].uv[rand() % 15];
break;
 
 
 
}
 
 
 
 
}
 
 
system("PAUSE");
return 0;
}
 

 
}
 

I started an Analysis and the caches seems weird or I don't understand them.

I've posted an image where L1 L2 are 0 and L3 is 0.017 and the programs seems to use big and random enough data processing.

Where should I start looking especially if I need to optimize another more complex program ?

Thanks

[attachment=30585:vtune.jpg]

Advertisement
Ugh.

1. Are you compiling this on release? It's pointless profiling debug builds.
2. Visual c++ will remove all of that code, because it's completely pointless. VC++ isn't that stupid.
3. Don't profile for loops.
4. Don't profile for loops.
5. Don't profile for loops.

Why not profile for loops ? If I iterate the first element of the data shouldn't the cache line be loaded with the data next to it and for the next element should be a cache hit ? And also why Visual Studio should change my code ? Arent the random number generated at runtime?

You aren't doing anything with 'c'. So VC++ will strip all the code that was used to generate that value. That leaves you with an empty switch, which it will strip. That leaves you with an empty for loop, which it will also strip. Since there's now no point in your object arrays, it will also strip those. Your program now consists of a single system call. :/

As I said, don't profile for loops, because even a poor compiler will strip them out. Profile applications instead.

I'm sorry, but the data you've shown is exactly what's supposed to happen.

You're completely random-accessing and looping through 253MB of data, which obviously does not fit in the cache, and VTune is telling you that you're DRAM bound. This is exactly what will happen if the first iteration indexes the float[5] and float[26600000]; and the next iteration indexes the float[99990] and the float[7898]. The cache is effectively useless, and all the bottlenecks will be in the DRAM.

What do you expect it to tell you?

Indeed, you are not properly understanding the analysis - "Memory Bound" refers to time where the CPU can't do anything because it's waiting for data to come from memory. So for example "L1 bound" means the CPU is stalling for data that is in the L1 cache. If the data isn't even in the L1 cache to begin with, the CPU can't stall on the L1 cache.

You've written a program that it deliberately cache-unfriendly - it should come as no surprise that it is not significantly cache-bound because the data being requested is almost never in the cache to begin with. This is why the DRAM Bound number is so high.

Remember, no metric is really useful by itself - you have to look at the bigger picture (in this case, all the memory bound statistics, not just the cache bound ones).

Cache misses have two misuses:

A) Memory allocation- Not alligned with the cpu cache.

The first thing to know is how big is your cache line? It's probably around 64 Bytes.

How many cache lines you can store in your cache? depends on the CPU.

So that's why big memory allocations would probably cause more cache misses.

However, it is not the main reason, The main reason for cache misses is B:

B) Memory access -

How do you acccess your memory? is it cache friendly or not

For an example: Think about accessing an array of ints (or event better, A matrix!)

If we load a 16x16 matrix of Int_32, then it would be great because all of the matrix is loaded into the L1 cache, therefore NO MISSES.

However, if we load a bigger matrix, the memory accessing becomes more important.

If the line is stored X1,Y1,X2,Y1,X3,Y3 and you access it X+1,Y, then it's good because you'll find your cache line quickly.

However, if the matrix is stored as X1,Y1,X1,Y2,X1,Y3,X1,Y4.... you'll have cache misses because it will look for another line.

I hope this little introduction helped you, There are TONS of guides in google, PLEASE check them.

If you want to invest in cache misses and miseuse of memory, that's the point to start looking :)

This topic is closed to new replies.

Advertisement