What does the dimensions of the cache size and numbers of the bus speed tell me about the computer?

Started by
11 comments, last by Cosmic314 10 years, 7 months ago

So I did research on how to use a program called CPUID CPU-Z to pull up this information about my computer.

From what I researched, bus speed dictates how fast information travel in your computer. sounds very similar to a highway. What is considered "fast" and universal among computers? Is it really as straightforward as saying higher the bus speed, the faster information travels? What makes it slow and fast or stablizes normally? Mines seems to flunctuate.

I am confused about the dimensions of the cache. From what I researched, a cache is a thing that stores stuff so when that stuff needs to be used again, it can be easily retrieved. So is a cache similar to an array?

Cache Sizes: L1 Data 2 x 32 kBytes 8 way
L1 Inst 2 x 32 kBytes 8 way
Level 2 4096 kBytes 16 way

Bus Speed: 199.511 MHz

I saw the diagram in http://en.wikipedia.org/wiki/Front-side_bus and realized out that they are connected together. But I do not understand what the meaning behind the 2 x 32 and the 8 or 16 way. What is a level inst cache?

Advertisement

This is a fairly involved discussion. Here is the 10,000 foot view.

Let's assume we have a CPU connected to a memory. The CPU can process data 100x faster than memory can be read or written from this memory. If the CPU needs the value of some memory location loaded then it's spending roughly 100 cycles waiting for the memory to respond. What can be done to speed-up the performance?

This is where the memory hierarchy enters. A smaller memory can be interposed between the CPU and the larger memory. While its capacity is much smaller than the external memory it is much quicker. What good does this do?

It wouldn't do an ounce of good unless you can continually rely on the cache to repeatedly use the same information or data over and over again. If you think to your code, this happens all the time: loops use the same instructions and data over and over. Function calls use the same code in memory. Routines typically use the same data throughout the routine. So this smaller but quicker memory can be an effective solution because code and data tend to be reused. This principle is known as instruction and data locality. Without it, caches would be useless.

So let's give a high level example. Assume 1 cycle = 1 CPU operation performed, 10 cycles = a 'hit' on your cache (a hit means your local data or instruction is present) and it takes 100 cycles to access data from main memory not contained in the cache.

Let's compare two machines, one with CPU + main memory and another with CPU + cache + main memory.

On the cacheless computer, sequential accesses to instructions are always at a cost of 100 cycles per instruction (this is a simplification btw -- typically CPUs fetch a batch of instructions at one time). Thus if we have 10 sequential memory accesses we spend 1000 cycles.

On the computer with a cache let's assume that the same 10 instructions are used as in the prior example. Let's further assume that the first 3 instructions are an iteration through a loop. The cost of access is 3x100 for the first three instructions. Assuming these instructions remain in our cache we will hit on the remaining 7 instructions at a cost of 7x10 cycles. We spend 370 cycles on the system with the cache.

Comparing both cases, to perform the same work it takes 1000 cycles on the cacheless machine vs. 370 cycles on the machine a cache, nearly a 3x speed-up.

There is much more to this discussion but that's the general idea of how caching works. Modern computer systems have a hierarchy of caches usually designated L0,L1,L2,L3, etc. Each increasing number indicates the memory "distance" away from the CPU -- the intervening amount of memory between it and the CPU. An L3 means the system has 3 caches between it and the CPU, for example (L0,L1,L2). Typically you'll find the L0 and L1 buried inside the CPU itself if you were to look at the implementation architecture. They are also typically tightly coupled, meaning you need both in operation for the CPU to run. Whereas higher level caches have the feature to be enabled or disabled depending on your needs. Mobile processing systems usually allow this to happen to conserve power.

Anyways, there's more to your question and I'll try to answer later when I have more time. An awesome book to cut your teeth on: Computer Architecture, A Quantitative Approach by Hennessy & Patterson. It is considered a classic.

Cache is high speed memory.

You want all your data to be as close to the processor as possible. Here are some approximate numbers:

1 processor cycle = 0.3 nanoseconds.

L1 cache hit = 4 cycles (1.2 ns)

L2 cache hit = 10 cycles (3 ns)

L3 cache hit, line unshared = 40 cycles (12 ns)

L3 cache hit, line shared in another core = 65 cycles (20 ns)

L3 cache hit, line modified in another core = 75 cycles (22.5 ns)

Remote L3 cache hit = 100-300 cycles (30ns-90ns)

Local dram = 60ns

Remote dram = 100 ns

Swapped to disk = 10,000,000 ns

If your code needs some data and it is already in the processor there is no wait.

If your code needs some data and it is kept in L1 cache there is a very small wait.

If your code needs some data and it is kept in L2 cache there is a longer wait.

If your code needs some data and it was already modified in another processor your program can briefly stall.

If your code needs some data and it is all the way out in main memory it can take a relatively long time for your program to continue.

If your code needs some data and that data has been swapped out to virtual memory on disk, it can stall your program for what seems like an eternity.

When they talk about cache associativity they are referring to how the data is stored within the cache. As a gross simplification, when you have a large number of cache bins you need to keep track of which bin contains which data. When you have 128KB of memory kept in 64-byte bins that is a lot of bins. On the one hand you want all your data in cache memory. On the other hand you don't want to search every bin in order to find which bin actually holds your data. If you have a 16-way set associative cache it means you need to sort through 1/16th of the bins to find the data.

Swap space = 10,000,000 ns


Fixed that for ya. Virtual memory has a very exact meaning, whose use of the disk is an implementation feature of particular operating systems, but not the sole component.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Thanks for that, updated the post.

This is a fairly involved discussion. Here is the 10,000 foot view.

*snip*

Anyways, there's more to your question and I'll try to answer later when I have more time. An awesome book to cut your teeth on: Computer Architecture, A Quantitative Approach by Hennessy & Patterson. It is considered a classic.

Thanks for the book recommendation.

Cache is high speed memory.

You want all your data to be as close to the processor as possible. Here are some approximate numbers:

*snip*

Interesting, I am understanding it a bit more as I continue reading this portion. Thanks frob!

1 processor cycle = 0.3 nanoseconds.
L1 cache hit = 4 cycles (1.2 ns)
L2 cache hit = 10 cycles (3 ns)
L3 cache hit, line unshared = 40 cycles (12 ns)
L3 cache hit, line shared in another core = 65 cycles (20 ns)
L3 cache hit, line modified in another core = 75 cycles (22.5 ns)
Remote L3 cache hit = 100-300 cycles (30ns-90ns)
Local dram = 60ns
Remote dram = 100 ns
Swapped to disk = 10,000,000 ns

Just to add some data points to this table, and drive the point home:
An integer instruction has a throughput of 1 to 4 cycles. So an increment may take 1.0 ns (3 cycles).

A SIMD instruction will be the same, except being SIMD, the throughput can be as much as 8 times more per operand. So with a vectorizing compiler, and ideal conditions, an increment would take about 130 ps.

So as a consequence a single instruction like ++i may take anywhere fron 130ps to 10 ms.wacko.png

EDIT: Devide that by the number of real cores, if you wanna count MIMD. That means a 6 core desktop processor like i7-3930K can do 20ps throughput for increment instructions, best case.

1 processor cycle = 0.3 nanoseconds.
L1 cache hit = 4 cycles (1.2 ns)
L2 cache hit = 10 cycles (3 ns)
L3 cache hit, line unshared = 40 cycles (12 ns)
L3 cache hit, line shared in another core = 65 cycles (20 ns)
L3 cache hit, line modified in another core = 75 cycles (22.5 ns)
Remote L3 cache hit = 100-300 cycles (30ns-90ns)
Local dram = 60ns
Remote dram = 100 ns
Swapped to disk = 10,000,000 ns

Just to add some data points to this table, and drive the point home:
An integer instruction has a throughput of 1 to 4 cycles. So an increment may take 1.0 ns (3 cycles).

A SIMD instruction will be the same, except being SIMD, the throughput can be as much as 8 times more per operand. So with a vectorizing compiler, and ideal conditions, an increment would take about 130 ps.

So as a consequence a single instruction like ++i may take anywhere fron 130ps to 10 ms.wacko.png

EDIT: Devide that by the number of real cores, if you wanna count MIMD. That means a 6 core desktop processor like i7-3930K can do 20ps throughput for increment instructions, best case.

It is hard to truly give a universal cost to instructions, except for perhaps, embedded processing systems.

The CPU pipeline amortizes the cost of a sequential flow of instructions assuming they have no hazards. And even though it may take one instruction, in situ, several pipeline stages to execute, a sequential instruction will be executing at the same time as the initial instruction and come out one cycle later.

For example, if you have an 11 stage pipeline and you issue one instruction it will take 11 clock cycles. However, if you have 11 instructions that have no hazards it will take only 21 CPU cycles to execute, rather than 11x11=121 cycles.

A further thing to muddy up the waters. Modern CPUs have multiple dispatch, even for SISD Multiple ALUs in several different pipelines can shorten that 21 CPU cycle cost even further.

When they talk about cache associativity they are referring to how the data is stored within the cache. As a gross simplification, when you have a large number of cache bins you need to keep track of which bin contains which data. When you have 128KB of memory kept in 64-byte bins that is a lot of bins. On the one hand you want all your data in cache memory. On the other hand you don't want to search every bin in order to find which bin actually holds your data. If you have a 16-way set associative cache it means you need to sort through 1/16th of the bins to find the data.

I can understand the desire to simplify things here, but I think you've oversimplified to the point where you've inverted what associativity means. A 16-way associative cache doesn't sort through 1/16th of the bins, it searches 16 bins. An 8-way associative cache would search through 8 bins.

To try again (still simplified): the idea behind a cache is that it stores the most recently used memory with the idea that if you've used it recently you're likely to want to use it again in the near future. One way to handle this is that when you access some memory, if it isn't already in the cache you find the bin that has gone the longest without being accessed and replace that bin with the new memory. Because any bin can be used for any memory address this is called a fully-associative cache - every memory can be associated with any bin. The down side here is that it takes effort to figure out which memory address has gone without access the longest and which bin currently has which memory address, which slows the cache down.

The opposite approach is called direct associative or one-way associative: every memory address can only ever be found in one of the bins. Now you no longer need to keep track of age or hunting down the bin for a given memory address. On the down side if you try accessing two memory addresses back and forth that map to the same bin, the cache might as well not exist because every access will generate a cache miss.

Between these two approaches are the n-way associative caches. Let's say you have a two way associative cache. Here every memory address has two associated bins. If you want to see which bin a given memory address is found in you only need to check two different bins. It's also a lot easier to keep track of which of those two bins has gone longest without being accessed. And since now two memory addresses that map to the same set of bins can be in the cache in the same time it's harder to get usage patterns that will generate a cache miss with every access. So replace two with some value n to get your n-way associative caches like the 8-way and the 16-way. As n gets bigger the more complex the circuitry gets, but as n gets smaller the easier it is to have certain memory access patterns that it can't handle efficiently.

This topic is closed to new replies.

Advertisement