Sign in to follow this  
Icebone1000

Stack and Cache question( optimizing sw in c++)

Recommended Posts

Hi, I was reading optimizing software in c++ and got in that stack speech, I dont understand why that: "The stack is the most efficient place to store data because the same range of memory addresses is reused again and again." I understand that the same range of assresses is reused again and again, but I dont see why this is efficient, since data are always rewriten at new functions calls. "If there are no big arrays, then it is almost certain that this part of the memory is mirrored in the level-1 data cache, where it is accessed quite fast." Explain me that. Why the hell it will be mirrored? Just because is the same used again and again range of addresses? Who cares about the addresses, if data is different it will have to be updated every time... I got confused since I have this feeling but about globals..since globals are globals because they more likely to be used over and over again in different parts of the code, to me THAT would be a good caching option... What am I missing?

Share this post


Link to post
Share on other sites
Quote:
Original post by Icebone1000

"The stack is the most efficient place to store data because the same range of memory
addresses is reused again and again."

Due to implied order of operations, most frequently accessed data on stack is almost certainly always in L1 cache.

Cost of stack allocation is zero since it is unchecked and most of the time evaluated at compile time and hard-coded.

Quote:
Explain me that. Why the hell it will be mirrored? Just because is the same used again and again range of addresses?

L1 cache has access latency of several cycles. Accessing memory in L2, L3 or DRAM has hundreds or thousands of times higher latency.

So you read one byte - CPU needs to wait 500 cycles for memory bus to provide the data. Then you read the next byte, CPU waits another 500 cycles.

Instead, when you read first byte, CPU loads the next 16 (or 64 or as many) bytes in one read into L1 cache. So on first access you pay 500 cycle penalty, but next 100 accesses cost only 2 cycles. The design of stack fits well into this and in typical usage

L1 cache is very fast, it is physically located inside the CPU (to avoid speed of light delays), but overheats and is very expensive to produce. So it is small (in the order of 64kb). The better this cache is utilized, the faster memory accesses are.

DRAM is huge, cheap, doesn't overheat or draw much power, but it is located very far away and bus accessing it operates at only a fraction of CPU's frequency, both of which means high latency when accessing such data.

Both of these favor bulk data accesses, so locality of reference is preferred.

Quote:
Who cares about the addresses, if data is different it will have to be updated every time...

Writes are indeed more complicated. In single core CPU, writes can still be done directly into cache and eventually be evicted into main cache.

Each thread has its own stack, and threads are the most common way to distribute code across cores, so there is no write conflict. Stack in this respect is shared-nothing design.

Multiple cores accessing shared memory do run into this problem and the code must either be reorganized to schedule writes in such a way to mask the write latency via pipelining, or accept the stall.

Quote:
What am I missing?

The fundamentals of hierarchical storage architecture and modern CPU design.

Cache, pipelining.

Share this post


Link to post
Share on other sites
Its still not very clear to me, you explained more what is cache 1, I want know more why stack data are most likely to be cached.

Basically what Im understanding now is:

Since data on stack is loaded to memory on bulks, and to cache this means "do more with less cycles", this is perfect combination...While globals are more likely to appear randomly spreaded over the code.

Is this correct?

it explains why is good to store on cache stack data.

But why is most likely to be cached?
"Due to implied order of operations, most frequently accessed data on stack is almost certainly always in L1 cache."

"Due to implied order of operations" doesnt explains much.

Im thinking on functions that are called over and over, every time the stack is updated with new values, so lets say the cpu alredy got a working set of local hard coded variables on the cache...that is the behavior?

Also, in that approach is most likely that globals are also cached right?(but not so efficiently used)


Thanks for the explanation.

Share this post


Link to post
Share on other sites
Quote:
Original post by Icebone1000
Its still not very clear to me, you explained more what is cache 1, I want know more why stack data are most likely to be cached.


Each memory access is cached. Cache is small (~64kb or 512kb) so when its full, least recently used cache line must be evicted and is replaced with new data. In addition, cache is organized in lines, so 64kb cache with 64b cache line can only hold 1000 different mappings.

For this reason it makes sense for all memory accesses to be as local as possible. Rather than reading from address 17, 543 and 1024 (loads 17(16 bytes), 543(16 bytes), 1024 (16 bytes), consumes 3 cache slots), it's better to read from 17, 18 and 19 (reads 17 (16 bytes), one cache line), which are all in same cache line and consume one slot.

Stack memory access, by definition, has best cache locality. It either grows continuously upward or downward, but rarely needs to make large jumps.

Typical function call (trivialized):
// stack pointer is 1200
push 10 // offset 0, obligatory cache miss, 1200 is not yet in cache, loads 0..16 into cache
push 20 // 1204, offset 4, in cache
push 30 // 1208, offset 8, in cache
// stack pointer is now 1212
call foo

foo:
mov c, sp-12 // load 10 from offset 0 (stack pointer-12) into c, from cache
pop a // pop 30 from offset 8, from cache
pop b // pop 20 from offset 4, from cache
// do something with a and b
ret -4 // return and discard last variable which was read directly above

This makes passing parameters and all call-stack operations very cache friendly and as cheap as they can be if registers aren't available.

Note that stack pointer is handled implicitly. If array were used, the code would like this:
arr[offset] = a;
offset += 4;
arr[offset] = b;
offset += 4;
arr[offset] = c;
offset += 4;

...
c = arr[offset-12];
offset-=4;
a = arr[offset];
offset-=4;
b = arr[offset];
This is the work done by compiler, and the push/pop operations, which is considerably more efficient than manually managed stacks, vectors or arrays. It is like a std::vector or std::stack, but one that CPU can manage on its own.
Quote:
While globals are more likely to appear randomly spreaded over the code.

There are no globals, CPU only sees numbers (0x0324122) *each time* memory access is made. Caching is done based on addresses, not content or context.

If memory at address of global, or first x bytes from that address are accessed frequently enough, it will remain in cache.

Quote:
with new values, so lets say the cpu alredy got a working set of local hard coded variables on the cache...that is the behavior?

Hard-coded applies to memory address of stack values. Rather than typical data structures which need to chase pointers and determine offsets, members on stack can be addressed directly relative to stack pointer. It always pointers to the top of stack, so saying something like [sp-16] in a function means access a specific member of stack frame, and compiler has generated this to mean c in function parameter list.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this