Number of arrays CPU can prefetch on

Started by
3 comments, last by Z01 11 years, 11 months ago
I'm thinking of converting an array-of-structures to a structure-of-arrays as an optimization in some SSE code. Its usually a good idea. I'm concerned though because the structure would be converted to 22ish different arrays. Is there a limit to the number of arrays that a CPU prefetcher will work on? (ie. is there a limit to the number of memory access patterns the prefetcher can remember, note that I'm *not* asking about the number of prefetches in flight)

Obviously the thing to do is try it and measure the performance difference. The problem is that I figure it will take about 3-5ish days of work to change the code around and I'm wondering if there's some hard limit on the number of prefetch prediction patterns that might mean I'm following a dead path. I thought I read something once about such a limit but I can't find any info on it now.

Concretely, (as an example) say I'm trying to parallelize 5x5 matrix inversion. Suppose I have an array of 1 million Matrix5x5 I want to invert. My question in this case is would I prefer the Matrix5x5 to be handed to me as a single array-of-structures or a structure of 25 arrays for SSE processing? eg. perhaps this would allow me to eliminate a lot of SSE shuffles by processing 4 matrices at once.
Advertisement
I would download the Intel Optimization Manuals from Intel's website. There is a lot of information, but Chapter 7 (Optimizing Cache Usage) should have most of your answers.

http://www.intel.com...er-manuals.html

x64/x86 CPUs have extremely sophisticated hardware predictive prefecthing capabilities so generally you shouldn't need to explicitly prefetch data in your code. The first iteration of something can be an exception to this since code can frequently 'surprise' the hardware prefetcher and you will frequently need to prefetch much farther in advance in the code-base yourself. This is frequently not very practical and you have to eat the first L3 miss.
http://www.gearboxsoftware.com/
I had a look through the Intel optimization manual a few days ago but I couldn't find anything specific, but you're right, maybe I should try the architecture manual instead, it might discuss such things more explicitly. And then there's the AMD manuals which might have something. The fact that the intel manual doesn't mention anything about the number of arrays/streams (that I could find) makes me think its a non-issue (or a trade secret).
From here, 2.1.5.4 (pdf).

Up to 32 streams, but only 1 forward/backward stream per memory page, depends on number of requests and many other factors. So if all your data is located close and iterated in same direction, you get one stream.

But it's not something you should optimize for, it's painfully model/microcode specific, a revision of CPU may change it.

structure of 25 arrays for SSE processing?[/quote]

Load/stores are by far the most expensive part, so minimize that.

My guess would be that two arrays, one for input, one for output would work best.
Thanks. I've also been looking at the structures-of-structures layout as an alternative to AoS or SoA, maybe I'll fiddle with that.

This topic is closed to new replies.

Advertisement