Should i use a linked list for this (C++)?

Started by
30 comments, last by King Mir 10 years, 6 months ago

No, the size of the cache closest to the CPU is what is labelled the L1 cache size for the CPU. For example, on the computer I'm at right now the L1 data cache is 32 kilobytes in size. The size of the cache line determines how much memory is pulled into the cache when reading an address, but you can have multiple cache lines loaded at the same time. With a 64 byte cache line, if you read a single byte, 63 other bytes are brought into the cache at the same time. Since cache line sizes are multiples of the page size so you'll never have cache line with data from more than one page at a time. What you will have are cache lines from multiple pages loaded at the same time. I have no idea what you mean by "volume around" and "program order" so I can't comment on that.

And yes, hard drive access times are still measured in milliseconds, not microseconds and it's hard drive access speeds that drive virtual memory performance. Even solid state drive latency is still typically measured in milliseconds (though often in fractions of milliseconds). Though unlike hard drive speeds, solid state drives will probably improve in memory latency speed in the future.

Advertisement

That simplistic model does not match modern Out of order execution x86/x64 processors. Instruction fetching, decode and branch prediction is done before micro-ops are scheduled to be dispatched on one of several out of order dispatch ports.

That is why I spoke only about cycles. The same pitfalls apply to both direct and indirect memory access, so it is not worth mentioning.
Specifically, JohnnyCode stated that adding an offset to get an address is necessarily slower than not, which is the only claim I refuted. You didn’t even mention data-cache fetches etc. Because they (and everything else you mentioned) apply whether or not offsets in the address operands are present.


Those dispatch ports don't have a way of guessing what their input will be until it is assigned. So the trick of always precomputing a load offset doesn't work.

The load offset is never precomputed. It is computed after fetching and decoding, prior to the next cycle at which execution will take place. If no offset is provided, it will still wait until the next cycle to execute.


L. Spiro

The fact is, indexing an array can actually have a 1 clock cycle penalty compared to dereferencing a bare pointer or pointer plus constant offset, if the address is < 2048 (I think this is offset from the memory segment). See table Table 2-12, in the optimization reference manual referenced earlier.

The address is not computer prior to execution. It cannot be, because the input registers are not available then. It is therefore presumably computed on the first cycle of execution of the load.

Also, the fact that addressing memory is part of the instruction, is irrelevant. Firstly, because by the execution stage an instruction can be split into micro instructions that a scheduled out of order, and secondly because the size of the data being fetched also imposes a latency penalty, even though that's not in any way apparent in x86/x64 assembly.

For completeness it's also worth repeating/clarifying that computing an offset can itself cause stalls, but in comparing with linked lists this tends to be less, because fetching the pointer from memory in a lined list introduces the same kind of stalls or worse. In fact the data dependency in computing successive node addresses can prevent or hinder the CPU in working on multiple values in the container at the same time, unlike for an array.

This topic is closed to new replies.

Advertisement