Quote:Original post by okonomiyaki
Essentially, a CPU which requires a 4-byte boundary is assuming that 4 bytes will hold the largest primitive datum, which would be your standard integer? Then all pages/caches/etc. deal with memory in 4 byte chunks, and makes sure no datum is split across pages or memory boundaries.
Pages and caches don't really have anything to do with this. They have different alignment requirements. X86 page size is 4k and they are always aligned with 4k boundaries in the physical memory. Caches tend to be aligned to boundaries equal to the size of cache lines.
Some CPUs, such as ARM, do require that 32-bit entities are aligned to 4-byte boundaries. Unlike x86, where unaligned dword accesses result in multiple memory accesses, ARM just cannot access unaligned 32-bit data directly, there are no such instructions. Of course you can always read 8-bits at a time and shift and or to make a 32-bit value. BTW, on ARM it's faster to copy memory between buffers that are aligned the same way for this reason. E.g. it takes more time to copy between buffers that start at addresses 0xBEEF0001 and 0xDEAD0002 than 0xBEEF0001 and 0xDEAD0001 (probably the same is true of x86 though).
Quote:
Does that mean using `long long' variables (meaning 8 byte integers) is dramatically slower than 4 byte integers (on 32-bit architectures)?
Yes, but not because of alignment issues.
There are no instructions for handling 64-bit integers atomically on 32-bit architectures (or at least on x86 and ARM). For the CPU core, it wouldn't make any difference if the two 32-bit words weren't even adjacent in memory. (Caching and paging is another thing.)
Quote:
16 bytes? Really?
I can't remember for sure and didn't bother checking Intel's docs, but I believe so.