Cache Misses OpenGl

Started by
8 comments, last by LorenzoGatti 4 years, 8 months ago

So cache miss is a thing. But is there any recommended cache miss count per time unit?
I rendered one textured square in opengl and ran it for 10 secs. The cache miss count was ~137,230,200 i.e 1.3 million times(using perf stat -e cache-misses). This seems bad. 
What I don’t get is I am not loading anything in rendering loop like Meshes, entities or anything. They are just plain opengl functions. So Should I take this for granted that it the base from which I should start optimizing as soon as I add entities and physics.

Advertisement

How did you detect and count your cache misses?

7 hours ago, Randy Gaul said:

How did you detect and count your cache misses?

perf stat -e cache-misses ./a.out.

Does that count cache misses on the video card by the GPU?

perf is a Linux command line tool that hooks into the kernel, so i doubt it has its fingers in the graphics driver or so, but i lack deeper knowledge.

7 hours ago, Green_Baron said:

perf is a Linux command line tool that hooks into the kernel, so i doubt it has its fingers in the graphics driver or so, but i lack deeper knowledge.

Since graphics drivers in linux are kernel modules, it is highly expected that such information should get propagated (except maybe for proprietary kernel modules like for nVidia).

Here is a doc I quickly found for intel hardware.

21 hours ago, ritzmax72 said:

So cache miss is a thing. But is there any recommended cache miss count per time unit?
I rendered one textured square in opengl and ran it for 10 secs. The cache miss count was ~137,230,200

Why recording for such a long time ? I think you should check for a single frame.

Cache miss will happen and I believe even more on the GPU side where you have very few control about this and one miss could happen to all work group individually and might even be reported for each waves.

11 hours ago, Alberth said:

Does that count cache misses on the video card by the GPU?

No, I meant for CPU. All physics and collision code will be run inside the main rendering loop(No threading planned). So If I get 1 million cache misses per 10 secs. I am scared that If I add physics it would be too much. 

Your CPU runs in the GigaHz range, ie 1,000,000,000 so 1 million is practically nothing, and 10 seconds is eons in CPU time (you're talking about 0.01 promil). It also makes a lot of difference whether this is L1 cache miss or L3 cache miss.

Also, "cache-miss" means the data isn't there when it needs it, it does not mean it is doing nothing at that time. The CPU does all kinds of re-ordering instructions streams, predictive branching etc. Finally as your main program is minimal, you don't give the compiler and/or CPU much options to schedule other work while it waits for the cache.

2 hours ago, ritzmax72 said:

I am scared that If I add physics it would be too much.

Don't think like that. It leads to premature optimization, which is mostly a complete waste of time, as it is impossible to predict whether you will have a problem, and if so, where.

Just build the program in a sane way (ie don't use bubble sort to sort a million entries), and when you hit a performance problem, profile where the problem is (since with 99% certainty it is not where you think it is), fix it, and profile again to check.

On 8/17/2019 at 8:57 AM, Alberth said:

Also, "cache-miss" means the data isn't there when it needs it, it does not mean it is doing nothing at that time. The CPU does all kinds of re-ordering instructions streams, predictive branching etc.

Moreover, there is a minimum possible number of cache misses, because you need to access your data at least once. You are only inefficient if you cache (and evict) the same memory location several times.
Supposing you need to read or write or update once per frame N bytes of data and your data cache is M bytes you cannot have less than N-M bytes of cache misses per frame; more cache misses can simply mean more data.

You should consider ways to compact your data structures to ensure coherent access without worrying about totals too much. For example, ensuring that you are using the data you put into the cache (for example, access arrays sequentially), and avoiding wasted space (padding, slack, irrelevant fields in a structure)

Omae Wa Mou Shindeiru

This topic is closed to new replies.

Advertisement