Sign in to follow this  

Cache misses and VTune

This topic is 676 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello. I've started using VTune and read about cache coherent data. Not I'm trying to uderstand VTune how it works by making some simple programs that will violate the rules of cache coherent data.

 

For example:

 

 

int _tmain(int argc, wchar_t* argv[])
{
 
 
int numbers = 500000;
 
 
struct object {
float matrix[16];
float matrix2[16];
float matrix3[16];
float matrix4[16];
float matrix5[16];
float matrix6[16];
float matrix7[16];
float matrix8[16];
 
 
 
 
float position[3];
float uv[2];
 
 
};
 
 
 
 
object *vars =  new object[numbers];
object *vars2 = new object[numbers];
object *vars3 = new object[numbers];
object *vars4 = new object[numbers];
object *vars5 = new object[numbers];
 
 
 
 
 
 
float c;
for (int i = 0; i < numbers; i++) {
 
switch (rand() % 10) {
case(0):
c = vars[rand() % numbers].position[rand() % 15];
c = vars2[rand() % numbers].position[rand() % 15];
c = vars3[rand() % numbers].position[rand() % 15];
c = vars4[rand() % numbers].position[rand() % 15];
c = vars5[rand() % numbers].position[rand() % 15];
break;
 
case(1):
 
c = vars[rand() % numbers].matrix[rand() % 15];
c = vars2[rand() % numbers].matrix[rand() % 15];
c = vars3[rand() % numbers].matrix[rand() % 15];
c = vars4[rand() % numbers].matrix[rand() % 15];
c = vars5[rand() % numbers].matrix[rand() % 15];
break;
 
 
case(2) :
 
c = vars[rand() % numbers].matrix2[rand() % 15];
c = vars2[rand() % numbers].matrix2[rand() % 15];
c = vars3[rand() % numbers].matrix2[rand() % 15];
c = vars4[rand() % numbers].matrix2[rand() % 15];
c = vars5[rand() % numbers].matrix2[rand() % 15];
break;
 
 
 
case(3) :
 
c = vars[rand() % numbers].matrix3[rand() % 15];
c = vars2[rand() % numbers].matrix3[rand() % 15];
c = vars3[rand() % numbers].matrix3[rand() % 15];
c = vars4[rand() % numbers].matrix3[rand() % 15];
c = vars5[rand() % numbers].matrix3[rand() % 15];
break;
 
 
case(4) :
 
c = vars[rand() % numbers].matrix4[rand() % 15];
c = vars2[rand() % numbers].matrix4[rand() % 15];
c = vars3[rand() % numbers].matrix4[rand() % 15];
c = vars4[rand() % numbers].matrix4[rand() % 15];
c = vars5[rand() % numbers].matrix4[rand() % 15];
break;
 
case(5) :
 
c = vars[rand() % numbers].matrix5[rand() % 15];
c = vars2[rand() % numbers].matrix5[rand() % 15];
c = vars3[rand() % numbers].matrix5[rand() % 15];
c = vars4[rand() % numbers].matrix5[rand() % 15];
c = vars5[rand() % numbers].matrix5[rand() % 15];
break;
 
case(6) :
 
c = vars[rand() % numbers].matrix6[rand() % 15];
c = vars2[rand() % numbers].matrix6[rand() % 15];
c = vars3[rand() % numbers].matrix6[rand() % 15];
c = vars4[rand() % numbers].matrix6[rand() % 15];
c = vars5[rand() % numbers].matrix6[rand() % 15];
break;
 
case(7) :
 
c = vars[rand() % numbers].matrix7[rand() % 15];
c = vars2[rand() % numbers].matrix7[rand() % 15];
c = vars3[rand() % numbers].matrix7[rand() % 15];
c = vars4[rand() % numbers].matrix7[rand() % 15];
c = vars5[rand() % numbers].matrix7[rand() % 15];
break;
 
case(8) :
 
c = vars[rand() % numbers].matrix8[rand() % 15];
c = vars2[rand() % numbers].matrix8[rand() % 15];
c = vars3[rand() % numbers].matrix8[rand() % 15];
c = vars4[rand() % numbers].matrix8[rand() % 15];
c = vars5[rand() % numbers].matrix8[rand() % 15];
break;
 
 
case(9) :
 
c = vars[rand() % numbers].uv[rand() % 1];
c = vars2[rand() % numbers].uv[rand() % 15];
c = vars3[rand() % numbers].uv[rand() % 15];
c = vars4[rand() % numbers].uv[rand() % 15];
c = vars5[rand() % numbers].uv[rand() % 15];
break;
 
 
 
}
 
 
 
 
}
 
 
system("PAUSE");
return 0;
}
 

 
}
 

 

I started an Analysis and the caches  seems weird or I don't understand them.

I've posted an image where L1 L2 are 0 and L3 is 0.017 and the programs seems to use big and random enough data processing.

Where should I start looking especially if I need to optimize another more complex program ?

Thanks

[attachment=30585:vtune.jpg]

 

 

 

Share this post


Link to post
Share on other sites
Ugh.

1. Are you compiling this on release? It's pointless profiling debug builds.
2. Visual c++ will remove all of that code, because it's completely pointless. VC++ isn't that stupid.
3. Don't profile for loops.
4. Don't profile for loops.
5. Don't profile for loops.

Share this post


Link to post
Share on other sites

Why not profile for loops ? If I iterate the first element of the data shouldn't the cache line be loaded with the data next to it  and for the next element should be a cache hit ? And also why Visual Studio should change my code ? Arent the random number generated at runtime? 

Share this post


Link to post
Share on other sites
You aren't doing anything with 'c'. So VC++ will strip all the code that was used to generate that value. That leaves you with an empty switch, which it will strip. That leaves you with an empty for loop, which it will also strip. Since there's now no point in your object arrays, it will also strip those. Your program now consists of a single system call. :/

As I said, don't profile for loops, because even a poor compiler will strip them out. Profile applications instead.

Share this post


Link to post
Share on other sites

I'm sorry, but the data you've shown is exactly what's supposed to happen.

 

You're completely random-accessing and looping through 253MB of data, which obviously does not fit in the cache, and VTune is telling you that you're DRAM bound. This is exactly what will happen if the first iteration indexes the float[5] and float[26600000]; and the next iteration indexes the float[99990] and the float[7898]. The cache is effectively useless, and all the bottlenecks will be in the DRAM.

 

What do you expect it to tell you?

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

Cache misses have two misuses: 

A) Memory allocation- Not alligned with the cpu cache.

The first thing to know is how big is your cache line? It's probably around 64 Bytes.

How many cache lines you can store in your cache? depends on the CPU.

So that's why big memory allocations would probably cause more cache misses. 

However, it is not the main reason, The main reason for cache misses is B: 

B) Memory access -

How do you acccess your memory? is it cache friendly or not

For an example: Think about accessing an array of ints (or event better, A matrix!) 

If we load a 16x16 matrix of Int_32, then it would be great because all of the matrix is loaded into the L1 cache, therefore NO MISSES.

However, if we load a bigger matrix, the memory accessing becomes more important.

If the line is stored X1,Y1,X2,Y1,X3,Y3 and you access it X+1,Y, then it's good because you'll find your cache line quickly.

However, if the matrix is stored as X1,Y1,X1,Y2,X1,Y3,X1,Y4.... you'll have cache misses because it will look for another line. 

 

I hope this little introduction helped you, There are TONS of guides in google, PLEASE check them.

If you want to invest in cache misses and miseuse of memory, that's the point to start looking :)

Share this post


Link to post
Share on other sites

This topic is 676 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this