How to use small datatypes efficiently

Started by
28 comments, last by Infinisearch 8 years, 6 months ago


If you can design your system so you are constantly reading a series of sequential memory addresses -- basically operating on densely packed linear arrays or a structure of arrays -- the CPU will see this pattern and prefetch all your data, giving a great processing experience.

If you are reading objects one at a time scattered across memory, or reading along distant steps like an array of structures, this can cause a huge number of cache misses and stalls as data is loaded from hither and yon.

I read somewhere that modern hardware prefetchers don't necessarily need adjacent addresses, they can deal with stride. So your array of structures example would be doable, if true.

-potential energy is easily made kinetic-

Advertisement
They'll still be prefetching cache-line sized chunks (probably 64B), so you'd want any gaps / want the stride to be a multiple of 64Bytes.

The prefetcher can do stride, yes, but linear access (both forward and backwards) is dead-simple for it to do. I'm not sure whether the prefetcher is looking at the address of the elements or at which cache lines are loaded (Seems to me the later would be less expensive, but the benefit might justify the higher cost of the former). If it is looking at cache-lines rather than addresses, certain stride sizes could cause the pattern to be hard to predict (e.g. stride size of 1.5 cache lines (96 bytes) would have a pattern of read cache-line A, read B, skip C, read D, read E, skip F... the prefetcher would help, assuming it notices that it should try something even if imperfect, but it won't get max benefit.

The other issue is that the prefetcher itself doesn't directly benefit throughput -- it only improves latency. If you have a 96 byte structure and you only touch 12-16 bytes (say, an object position), then you're loading one cache-line per object and reducing latency of that cache line being ready only helps so much. On the other hand, if you have tightly-packed elements accessed linearly, it often works out that the prefetcher entirely (or almost-entirely) hides the latency behind the work you're doing on the previously-loaded elements. In fact, I've seen a graph (I believe it was in a talk by Herb Sutter) that showed how, for linear (forward or reverse) accesses, the prefetcher behaves very predictably and significantly better than other access patterns (with random access being worst-case, and strided access in between) -- the prefetcher looks very much like an infinitely-large last-level cache for linear accesses.

throw table_exception("(? ???)? ? ???");

They'll still be prefetching cache-line sized chunks (probably 64B), so you'd want any gaps / want the stride to be a multiple of 64Bytes.

This is true although I wonder if the prefetcher accelerates both prefetches for non aligned accesses that straddle cachelines. It most probably does but I wonder how much faster it turns out.


The other issue is that the prefetcher itself doesn't directly benefit throughput -- it only improves latency. If you have a 96 byte structure and you only touch 12-16 bytes (say, an object position), then you're loading one cache-line per object and reducing latency of that cache line being ready only helps so much.

Repeated latency savings amount to an increase in bandwidth, but yeah you're right, using portions of a cacheline reduces bandwidth efficiency/bandwidth.

-potential energy is easily made kinetic-

Repeated latency savings amount to an increase in bandwidth,


I know what you're trying to say, and its even mathematically true (throughput is comprised if two terms Elements-processed over Time, where in terms if caches Elements is the size of the cache line and Time is derived from latency), that said, I find it more productive to think about throughput and latency as orthogonal most times.

They certainly are related, but you typically have much more control over what elements of a cache line are next to each other (useful elements) than you do over when those elements are lower in the cache (or how many times you need to look at the same/duplicate data). That is, you can't actually reduce the Time term except by using more of the elements in close proximity. Thus, latency only really matters for loading something new (not yet first-level cached).

throw table_exception("(? ???)? ? ???");

In the general case, aren't latency and throughput completely unrelated.
e.g. a factory might output 1 car per hour (24/7). The latency between raw materials going in one end and a completed car coming out the other is irrelevant -- it could be an hour or a week, but the throughput is still 1 car per hour.

However, if the assembly-line/pipeline stalls, then latency becomes an issue.
Say that when the factory closes, all cars must be finished. And the latency of building a car is 3 hours.
At the start of each day, the pipeline is empty, so it takes 3 hours until the assembly line is up amd running at full capacity (when the factory begins the first car of the day, the final workers on the assembly-line are idle).
At the end of the day, you have to stop feeding in raw materials 3 hours before closing time, so the first workers on the assembly-line become idle.

GPU's have massive amounts of latency (e.g. 50 million CPU cycles), but also massive throughput thanks to great pipelining.

Back to CPU prefetching (which can reduce the *perceived* latency of loads, by starting them earlier), if doing so avoids a CPU pipeline stall, then it will improve throughput of the CPU (otherwise it really does nothing).


than you do over when those elements are lower in the cache (or how many times you need to look at the same/duplicate data). That is, you can't actually reduce the Time term except by using more of the elements in close proximity. Thus, latency only really matters for loading something new (not yet first-level cached).

What do you mean lower in the cache? Latency matter for things in the L1 as well, since latency for L1 is around what 7cycles now. Out of order execution with large instruction windows and speculative execution can make it as if accessing the L1 is like accessing a register. (not exactly but you see what I'm getting at)


In the general case, aren't latency and throughput completely unrelated.

What do you mean by the general case? BTW don't forget there is critical first word latency and then there is burst transfer latency in regards to main memory access, I think the transfer latency is a single cycle for cache access. In the case of AOS or SOA data being accessed in a tight loop, if you reduce the latency to initiate a transfer you will have more transfers in a given period of time i.e. bandwidth usage (throughput) is increased.

Oh by the way I was introduced to this document very recently, it might be useful to people here... "What every progammer should know about computer memory."

https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

-potential energy is easily made kinetic-

I think we're getting into a fiddly bit of semantics and also possibly talking about different things that are becoming muddled -- I'm not sure latency of memory accesses has much to do with the way you used the assembly line analogy (that seems more apt for latency of CPU pipeline results), since the memory access would be more like the supply of parts to the line, rather than the resulting widget.

Anyways, now again for benefit of OP and viewers since I'm sure you're aware of all this Hodgman... If you assume for a moment that we can retire (process) elements as quickly as we receive them, then throughput is only affected by the time between getting a new elements -- the latency of loads. But we can't actually reduce the the true latency of the read (intel and physics say so), we can only help the prefetcher be well-predicted so that we see less perceived latency to start with, and then we can make sure to get the most out of each cacheline by making reducing the amount of useless stuff in it so that the latency term is amortized across more elements.

Of course, you could be doing so much work per element that the processing time hides even trips all the way out to RAM and back, but most (non-pathological) workloads don't (otherwise CPU vendors wouldn't bother).

Long story short, everyone who's not yet clued in thinks that CPU cycles are the only finite resource you have to worry about in a computer -- maybe some worry about the amount of RAM you have to work with too. But that's really not true any more, CPU cycles aren't even the *primary* finite resource you should worry about. The most-limited (and potentially most limiting) resource is really the number of times a cache-line can be brought in from main memory. On a typical system (DDR3-1600, dual-channel) you have a potential bandwidth of 25.6GB/s, but worst-case access patterns (pure random, single-byte reads) reduces that to just 400MB/s of usable data. In other words, your modern CPU (lets say an intel i3, i5, or i7, between 2.5 and 4 ghz, with between 2 and 4 cores) does between 50 and 200 billion instructions per second, and for comparison it can only touch RAM about 400 million times per second -- even a slow CPU has about 100 CPU cycles to spend on every cache line, a fast CPU has more like 400 cycles. You have two full orders of magnitude more CPU cycles than times you can touch RAM.

throw table_exception("(? ???)? ? ???");


What do you mean lower in the cache?

I just mean lower -- The ALU/FPU only takes from registers, and continuing upwards the registers only take from L1, the L1 only takes from L2, L2 from L3, and L3 from system RAM (assuming no L4, which are rare and typically victim caches anyways). If the data you want was previously evicted from L1 but still exists at L2, then you pay L2 latency on first access and L1 latency on subsequent accesses of that cache line. What I meant is that you don't achieve smallest latency unless you consistently work out of L1. Since L1 is very small, competitions for cache lines is very high.


Latency matter for things in the L1 as well, since latency for L1 is around what 7cycles now.

Yes, something like that. But as I said, competition is very high and you're likely to find yourself evicted if you lollygag. Furthermore, if you aren't making efficient use of your cacheline, then you need to load another sooner, which evicts some other cache line whose process may not be done with it, causing it to reload from L2 next time around -- all of which adds to the competition problem. Keep in mind your own app may be running several threads on a physical core, not to mention all the OS threads and other apps.

Each level of the memory hierarchy has an associated latency, these latencies never change -- only the level of the hierarchy in which your data can be found causes its aparent latency to go down. You don't have any control over which level it's found in, either, you only know that if you're reading it now its in L1 and that it probably won't be there long. The only influence you can exert, then, is by being ready to use the cache-line so that you can retire it as quickly as possible before the competition causes it to be evicted -- you want to do as much work as possible with every cache line, and that means the cache line needs to be filled with useful data.


Out of order execution with large instruction windows and speculative execution can make it as if accessing the L1 is like accessing a register. (not exactly but you see what I'm getting at)

Yes, of course, closer is better. Fancy modern processors make L1 retrieval so fast that its almost free -- it only takes a few independent instructions to mask it. But again, you've got to keep it in L1 for that to hold, and that's hard to do. Luckily, what's good for L3 cache is also good for L1 cache, so you don't have to think much about the differences (only really when you start talking about the size of your active data set, since obviously 256k of 'hot' data isn't going to fit in L1) -- for smaller working sets that fit inside L1 (or especially one-and-done cache line accesses) its well enough to just worry about what's good for L1 cachability. Regardless of cache level though, the principles of well-predicted prefetches and dense cachelines hold true.

throw table_exception("(? ???)? ? ???");


I just mean lower -- The ALU/FPU only takes from registers, and continuing upwards the registers only take from L1, the L1 only takes from L2, L2 from L3, and L3 from system RAM (assuming no L4, which are rare and typically victim caches anyways). If the data you want was previously evicted from L1 but still exists at L2, then you pay L2 latency on first access and L1 latency on subsequent accesses of that cache line. What I meant is that you don't achieve smallest latency unless you consistently work out of L1. Since L1 is very small, competitions for cache lines is very high.

Yeah I really should of got that... I haven't used the higher/lower in regards to the cache hierarchy in a while.

-potential energy is easily made kinetic-

This topic is closed to new replies.

Advertisement