How does CPU access memory

Started by
8 comments, last by RobTheBloke 12 years, 8 months ago
Apologies if this is the incorrect section of the forum to post this.

This is hardware question. I'm just wondering how the CPU addresses memory.
I've read many different articles and forum discussions on this topic but I still can't understand it.

I found it easier to ask my question with the help of an image which I've attached to this post.

[attachment=2205:Misalignment_Q.png]

As far as I'm aware, the CPU can only access a memory location at every 4 bytes (e.g. 0x00, 0x04, 0x08...). Is this true or have I picked this up wrong?
If this is true; what stops the CPU from accessing an arbitrary memory location? If it could access memory at any location it wanted then it would solve misalignment issues and save some memory from being wasted on padding data.

I know there is obviously a solid reason out there for this. I just hope someone can help me understand it.

Thanks.
Jai.
[size="3"]Hopes up High - Head down Low
Advertisement
Someone will probably correct me on this but CPU is designed to access every word, its not that it can't, its that its a faster operation that way. And as for the misalignment problem, the CPU was designed to grab one cache line at a time, and depending on where you data are set, it can span over more than one cache line. So an obvious fix to this is grab the first cache line and then grab the second, but this is a slow alternative.
Another solution is to use hardware that handles misalignment or depend on the software( ex. compiler ) to handle that for you.
Nothing stop the CPU from accessing 'any' memory location, each logical address gets mapped to a physical address via page table. But your problem doesn't really have to do with this. Again, its faster for the CPU to
read in one word from a cache line, the word could be 4 bytes or whatever. So accessing a word starting at a certain location won't solve your problem because your data can span over multiple cache line, so it would still take the CPU another cycle to grab the second cache line.

Anyways, as I said, someone will probably kill every letter of my response, but I say, hey I tried.

regards, D.Chhetri
Edge cases will show your design flaws in your code!
Visit my site
Visit my FaceBook
Visit my github
Memory bus.

Naive bus would allow arbitrary number of bytes to be read, even single byte.

But since such systems are pushing the speed of light, latency becomes unsolvable problem. So by reducing address bus width and increasing data width, bandwidth is increased for no extra cost at same latency.

As mentioned in link, it makes sense for data width to be same or larger than a single register to reduce overhead. It may also simplify addressing and reduce instruction size.

Many CPUs today rely on caches to improve memory access performance. Caches are some of most expensive and largest parts of CPU so they need to be kept simple. By putting restrictions on addressing that fits well into how caches store data one gains a lot for relatively little effort.

save some memory from being wasted on padding data.[/quote]This type of memory overhead doesn't matter - time to access the data via bus is same whether accessing 1 byte or 64, as long as they are consecutive. The penalty comes from loading into registers.

Intel's and x86 CPUs in general are very lenient, they allow almost completely arbitrary memory addressing, IIRC even with SIMD, at least on recent chipsets. Most others are not. It's simply not worth the extra complication.

It's interesting to note that a lot of hacks used in C++ violate alignment rules, despite being specified as "undefined" by standard. Unions, casts and related conversions can all trigger CPU errors on certain architectures, but rarely if ever on x86.

POSIX, for example, has a special error SIGBUS which triggers in such case.

This is hardware question. I'm just wondering how the CPU addresses memory.
I've read many different articles and forum discussions on this topic but I still can't understand it.

As far as I'm aware, the CPU can only access a memory location at every 4 bytes (e.g. 0x00, 0x04, 0x08...). Is this true or have I picked this up wrong?
If this is true; what stops the CPU from accessing an arbitrary memory location? If it could access memory at any location it wanted then it would solve misalignment issues and save some memory from being wasted on padding data.


There are a lot of different chips and architectures out there.

What is valid one one chip may not be allowed on another.


Most modern systems operate on cache lines, which are blocks of data at least as large as an individual byte. As far as programming is concerned, memory is traversed as single bytes. As far as the hardware is concerned, memory may be traversed as partial bytes (nibbles), or in individual bytes, or 2-byte increments, or in 4-byte increments, or 8-byte, or 16-byte, or 64-byte, or any other size increments.


For example, most people here are used to the x86 family. Today's x86 chips access memory in blocks, pulling it through the various caches (main memory -> L3 -> L2 -> L1) in progressively fine increments. The individual cache lines for the CPU may be 64 bytes long, the next level cache may be 512 bytes (with 8 L1 entries per line), and the next level may be 4K per line. Or whatever it happens to be for that chip. Assuming that was the cache hierarchy for the chipset the smallest block the machine will request is actually 64 bytes, not a single byte. It will progressively invalidate cache lines through the various levels out to main memory.

But look back in time to x86 machines that had much smaller caches, and going back even farther had no cache at all. The 80286 had a very small cache, and one trick to identifying your processor was to modify an instruction a few bytes ahead of the current instruction within the cache; generally changing subtraction to addition. If the value was added you knew the value fit within the cache and your processor was at least that minimum version. If the value was not modified then you likely had an older machine such as the 8086 without the cache.

Look at the huge number of ARM chips used in everything from handhelds to cell phones. There are hundreds of layouts for the systems. Similarly with Intel, the x86 chipset is just one of the many systems, they have many others used for embedded devices like XScale and PHY, to name just two. Then there are other brands like Motorola, MIPS, DEC, Power, VAX, and many, many others, each with their own instruction set families. Their memory systems are hugely variable, ranging from zero cache whatsoever to full on-die memory operating at cache speed.



You also talked about alignment.

The x86 family will incur a penalty for misalignment, but the architecture will continue to run. This is the exception rather than the rule across all architectures. For most embedded systems, cell phones, and other processor families, having misaligned values will cause a crash or even a full hard stop of the CPU. It is the programmer's responsibility to follow the architecture's rules.


There is a huge world of computer processors out there. Assuming you've opened your PC before you have seen it is made up of hundreds or even thousands of smaller chips, each with their own instructions and memory and programs working subordinate to the main processor.

When talking of programming generally, you need to remember that it isn't talking about just the big i7 Core, but that programming applies just as directly to the individual components on every board and device you attach to the machine from the high-end graphics processors to the components within your mouse to the little microcontroller that runs your keyboard lights.
Thanks for replying guys.

It has became quite apparent that I'll need to read up a bit on cache-lines =P.

My confusion still exists, I'm afraid.

As far as the hardware is concerned, memory may be traversed as partial bytes (nibbles), or in individual bytes, or 2-byte increments, or in 4-byte increments, or 8-byte, or 16-byte, or 64-byte, or any other size increments.[/quote]

If this is the case then why would the following (Example 1) cause 2 reads, a shift and an 'OR' to produce the resulting data yet, in Example 2, the CPU can just access address '1' and take the required byte.

Example 1:
<-----4 bytes----->
0---1----2----3----4----5----6----7----8 <-- Address
|----o----o----o----|----o----o----o----|
...........^____________^
...........<-----4 bytes----->
CPU needs these 4 bytes of data. (Sorry about the full stops. Goes crazy out of format without them).


Example 2:

<-----4 bytes----->
0---1----2----3----4----5----6----7----8 <-- Address
|----o----o----o----|----o----o----o----|
.....<--->

CPU needs this 1 byte

In Example 1 as I'm picturing it, the CPU could state that it wants to read 4 bytes starting from Address 2 (as it can traverse memory in individual bytes, 2 bytes etc..).

In Example 1, this would make sense to me if the CPU could only access memory locations starting at word boundaries (i.e. 0, 4, 8, 12 etc) and it was completely unable to access memory starting at any location (i.e 1, 2, 3, 5, 6 etc). But then this would mean that for the CPU to get the required byte in Example 2, it would have to read the entire word (or 2 bytes) starting from address '0'. (I've got a pretty messed up visualisation of this in my head, haven't I?)

(Maybe the reason for this is the interaction with the cache-lines and for me to fully understand this, I'm going to have to read up on them)

I'm not trying to argue with fact - I could just continue reading through this book, which started my confusion, and just do as I'm told - Keep data members aligned!
But I'd much prefer to understand why.

Thanks and sorry for any irritation caused.
Jai.
[size="3"]Hopes up High - Head down Low
The reason for 'why' is that, in many cases, the CPU's memory controller will only read in data in multiples of cache-line lengths and as such will only use the addresses as multiples of those cache lines.

So, if your cache line is 4 bytes long then the CPU will ALWAYS issue a read for 4 bytes and will always align to 4 byte address (so 0x0, 0x4, 0x8, 0xC are the only offsets which will form the final nibble of your addresses).

Lets say you have code which is running on a system with 4 byte long cache lines and has a bus wide enough to load 4 bytes at a time.

Thus if your code wanted to access the 3rd byte of memory the CPU will issue a read for all 4 bytes and then supply you with that third byte via a register, keeping the rest in the cache.
Now, lets say you wanted to access 2 bytes starting at byte 4, this time the CPU will issue 2 read requests (one at offset 0x0, one at offset 0x4) which will take 2 cache lines and then it will correctly present you with the to bytes you were intrested in via a register (how it does it is somewhat unimportant, shifting and masking is one method).

For a modern process the same idea holds, just cache lines are bigger and so is the bus size the data travels across.

Also, recent x86/x64 CPUs will do whats known as agressive pre-fetching where they will assume that as you are accessing one cache line of data you'll probably want the next couple as well and will issue memory requests to bring the data into the cache before it is used. So, in our example above for accessing the 3rd byte, the CPU would issue the inital read at offset 0x0 and then might go on to speculatively issues reads for 0x4 and 0x8 as well so that they are 'warm' in the cache.
Knowing details about the hardware, like how the cache works, can help with optimizations and help guide your programming decisions. It is something that is nice to know and occasionally useful, but not something to be overly concerned about.

You can get great performance boosts through careful use of the cache, but as the cache sizes are ever-increasing and the cache prediction improves and compilers improve, it is generally not worth the investment.
As mentioned originally, from a PROGRAMMING perspective you generally don't need to be concerned about the cache and cache effects.

I could just continue reading through this book, which started my confusion, and just do as I'm told - Keep data members aligned! But I'd much prefer to understand why.[/quote]

Alignment is very different from cache effects.

The CPU's memory cache happens behind your back, you generally don't need to know anything about it. You don't need to follow any particular rules, it just works.



Alignment is important when moving blocks of memory around. The compiler generally handles it for you, but only as long as you follow the rules.

The C++ language only says that it will set up correct alignment when you create objects. You'll need to respect that alignment. Since the language is portable you don't have any guarantees about what the alignment is. It might have no alignment requirement, or require 2-byte alignment, 4-byte alignment, 8-byte alignment, 16-byte alignment, or 32-byte alignment, or something else entirely.


The "why" is because the hardware manufacturer said so. The instruction set may require that when loading a float into a register it must be properly aligned on 4-byte boundaries. The may require that when loading a 128-bit variable for SSE it is aligned on a 16-byte boundary. This is a concern at the assembly-code level and varies based on the processor's instruction set.



A benefit of a higher-level language like C++ is that you don't need to know those requirements. It happens automatically. What's more, you may have written it on one processor (perhaps the x86 family), but you can compile the code to run on a different (perhaps the ARM9 family, or DEC Alpha or an older VAX PDP) processor and the compiler will automatically move things around for that platform. That is the compiler's job, and it does it well.

The drawback is that sometimes you want to move objects around in memory without respecting alignment. In that case you need to be aware that because of your own unsafe actions objects may become mis-aligned.

Let me reiterate that: The compiler will correctly handle alignment for you. You only need to know about it when you intentionally force objects out of alignment. You will need to put them back in alignment before reuse as a type the compiler recognizes.



Let's give an example.

Let's assume you are writing code for a particular smart phone. It's got a good ARM processor, and supports floating point. Let's say you are working on the networking code for the game.

You are writing serialization code to send a packet across the network. You could just dump a raw packet across the network since the clients are identical on both ends. However, you know that doing so would waste a bunch of space due to poor packet layout. Wanting to save space, you pack your data with no padding between elements. It ends up looking like this: { long, long, char, float, float, char }. Your offsets are: 0, 4, 8, 9, 13, 17.

As long as you work working only within memory as raw data, coping those values from one memory location to another memory location, there is no problem. You can use memcpy or other pointer-based manipulation to move the integer values and floating point values to and from that buffer without an issue. As long as you treat them as pointers to raw data there is no issue. It only becomes an issue when you assume a particular type.

But one of the programmers doesn't know that. He writes:

float f = (float)(*ptr+13); // CRASH on the device.

The data is not aligned, and is still should be treated as just plain arbitrary data. Instead of copying the data into a properly aligned location, he has just gone ahead and loaded a misaligned value. The programmer has used a cast that tells the compiler "I know what I'm doing, assume the data is actually a float". However, it was not correct because it did not fit the requirements for a float. The correct approach is to create a float variable (which the compiler will automatically align for you), use a memcpy or other pointer-based manipulation to move the data into the variable, and then use the variable.

I choose the ARM processor because it is one of the many processors that won't handle float misalignment. The x86 family will succeed with a performance penalty, but it doesn't handle every alignment issue gracefully; other items like the 128-bit parameter for an SSE command must be properly aligned or you'll crash.



Again, the compiler abstracts this away from you and it is something you don't need to deal with until you start intentionally moving values out of alignment. Any time you potentially move a value out of alignment you need to remember to return it to the compiler's expected alignment before using it again.
Thanks very much for your replies guys. I've got a clearer picture in my head now.

Going to go look into CPU Cache :).

Jai.
[size="3"]Hopes up High - Head down Low
Hi I just found this presentation yesterday explaining behavior of cpu & caches

Bad visual (improved somewhat if you fullscreen) but the presenter is clearly spoken and informative.

http://skillsmatter.com/podcast/home/cpu-caches-and-why-you-care

If this is the case then why would the following (Example 1) cause 2 reads, a shift and an 'OR' to produce the resulting data yet, in Example 2, the CPU can just access address '1' and take the required byte.


This is a hardware issue you simply have to live with....

The address/data bus, is hard wired to read in blocks of 'N' bits, (99% of the time) aligned to memory addresses in multiples of 'N' (and 'N' changes depending on the CPU architecture). So for a very very simple example, take a look at how the Z80 does it (He does waffle on quite a bit - skip to 7:45). The z80 can only read on 8bit boundaries, and can only read 8 bits at a time (so you could not read on a 4bit boundary for example). There is no way around this due to the physical wiring that exists on your circuitboard (aka motherboard).

Old 32bit Intel CPU's extended this to blocks of 32bits, on 32bit boundaries.
Modern Intel CPU's read/write 128bits on 128bit boundaries.

There isn't anything you can do about it because it is literally hard wired that way.... (The programming language may support it, but the CPU does not.....)

This topic is closed to new replies.

Advertisement