As HappyCoder mentioned, this is almost entirely automatic. The CPU is able to detect sequential access patterns. Even if you are pulling from two or four or more data streams, as long as you are traversing them sequentially the CPU can almost always detect it and prefetch your data.
A couple of days before there was topic on that, i was asking mainly for two examples of prefetching that could
be expressed in such simple example
for(int i = 0; i < N; i++)
A[i*4] = B[i*3];
1) many interlaced acces arrays to prefetch - will this work ?
2) prefetching by stride - will this work ?
I found some hints over the net which say
1) seems so that many interlaced streams can be prefetched (I do not know if there is some limit I suspect there is some)
2) prefetchers can do by stride
- (probably they do it in stride way that is they are able to get some bytes, skip gap, get some bytes, skip gap, not just all contigious memory (that stride acces was in the suggestions I was reading then about it - but if cache line to get is 64 bytes it would seem that will probably get at least 64 bytes - i dont know)
[if so thinking about examples in those previous topic
can bring such outcome - when scanning full table of used and unused record prefetcher will prefetch it all -
this is more cache streem (used and unused records)
but no cache misses - in the second example when
reaching only for used records - it would be much slight
cache stream but I think prefetcher could not work
here - if so I could suspect that first example can be
better]
- they can not work when stride is big about 1000 or something , and in such case there would be a stall
thats what I found - it is at all important knowledge to consider imo if someone is interested in optimization