Archived

This topic is now archived and is closed to further replies.

which way is the most cache friendly?

This topic is 5121 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

This has always bothered me. Which of the following is the most efficient?  Assume that the array was created like so:
someobject** somearray = new someobject*[width];
for(int x=0;x < width;x++) {
somearray = new someobject[height];
}     
A)
for(int x=0;x < width;x++) {
for(int y=0;y < height;y++) {
...somearray[x][y]...
}
}     
B)
for(int x=0;x < width;x++) {
for(int y=0;y < height;y++) {
...somearray[y][x]...
}
}     
C)
for(int y=0;y < height;y++) {
for(int x=0;x < width;x++) {
...somearray[x][y]...
}
}     
D)
for(int y=0;y < height;y++) {
for(int x=0;x < width;x++) {
...somearray[y][x]...
}
}     
[edited by - Raloth on February 8, 2004 6:25:01 PM]

Share on other sites
I would think that all 4 would have O(n^2) behaviour.

Share on other sites
Unless you take into consideration width and height not being equal (and as long as you do it right that won''t be a cache issue, it will be a branch prediction issue) C and D are the same as A and B.

The reason one would be better then the other in terms of caching is because one is sequentially moving through memory (so that you can process an entire chunk of memory before requesting another) and one is making random accesses, which if your structure is large enough will require a new chunk to be brought into cache for every single iteration (which would be very slow). For this A seems to be the right one, since in memory you have A[0][0];A[0][1];A[0][2];A[1][0];A[1][1];A[1][2];A[2][0];A2][1];A[2][2];.

Share on other sites
porthios: not when a cache is involved

Right. B and C are dog slow, because you''re jumping around in memory. D transposes your matrix. A is the one, but only because of the non-canonical way you set up the matrix.
The ''correct'' way is:
int m[h][w];
for each y: for each x: m[y][x] = 0
You want to scan sequentially in memory in the inner loop.

Share on other sites
So...A and D are equally good, right?

Share on other sites
Only if width = height. The matrix is declared as [width][height], and D accesses it backwards.

Share on other sites
quote:
Original post by Jan Wassenberg
porthios: not when a cache is involved
Nah, it still boils down to O(n^2) (or O(w*h) more accurately), becuse incorrect lookup may only make it O(n^2 + n) where the latter n could have a high constant, but it''s still same as O(n^2)

Share on other sites
quote:
Original post by Anonymous Poster
O(n^2 + n)
Oops, more like O(n^2) for incorrect look-up (n^2 times normal mem access) and O(n^2+n) for good look-up (n times slow access, roughly n^2 times cache-friendly access)

Share on other sites
quote:
Original post by Anonymous Poster
Oops, more like O(n^2) for incorrect look-up (n^2 times normal mem access) and O(n^2+n) for good look-up (n times slow access, roughly n^2 times cache-friendly access)

i just have to mention here, why i dont like that academic "if it has roughly the same number of trailing zeroes for huge n it is equal". or maybe one should insist on the word _complexity_ which doesnt translate to _speed_. use that O-notation stuff to compare complexities of algorithms, not as a weird way to say "cache and parallel execution doesnt matter, its still equally fast".

in the end w*h*cacheaccess and w*h*cacheaccess+ w*h*overhead obviously isnt "the same", no matter what some "rough estimation notation" might say.

Share on other sites
quote:
This has always bothered me. Which of the following is the most efficient?

object* objectArray = new object[width * height];

Allocate a single contiguous array helps [hardware, especially] prefetchers do the right thing and allows you to walk the entire array with a pointer increment (strength reduction is your friend).

Whether you want the element at row r, column c to be at objectArray[r + (c * width)] or objectArray[(r * height) + c] will depend as much on what you want to do as anything else (i.e. row-major or column-major may be optimal, depending on the operations being performed -- for matrix premultiplication, for example, it would be preferred to have one matrix row-major and one column-major).

You might want also to investigate a cache agnostic configuration, though that will be more complicated.

Share on other sites
On a different note - while loops always used to be faster than for loops too. Not sure if the VC++ compiler has optimized this out of existence yet...

Share on other sites
It is blatantly obvious that the problem of accessing every field in an NxN matrix, given random access memory, has a complexity of O(f), where f(n) = n^2. However, the ''random access machine'' assumption underlying algorithm analysis is not strictly true if we have cache on top of the memory. With worst case access patterns, each access will take ~k operations, where k is the cache line size (assuming a fully associate cache of at most N-1 lines, with LRU replacement strategy). If k approaches N, the total number of operations needed to access all of the elements is roughly N^3. Therefore, your statements are incorrect in general.
Now comes the argument that k is typically smaller than N. To this I reply: the overwhelming majority of matrices used in this field are 4x4 anyway. In fact, justify your use of asymptotic notation - that''s pretty hard, unless you''re doing heavy scientific number crunching.

> _complexity_ which doesnt translate to _speed_
Correct. These are distinct. In fact, the above posters sound like they just learned O-notation

Share on other sites
quote:
Original post by Jan Wassenberg
To this I reply: the overwhelming majority of matrices used in this field are 4x4 anyway. In fact, justify your use of asymptotic notation - that''s pretty hard, unless you''re doing heavy scientific number crunching.
I think porthios'' comment was meant as a half joke, such as "constant factor doesn''t matter". At least I took it as a joke, because when one asks how to optimize cache efficiency, they do care about constant factor. But even so, if you claim that O(n^2) doesn''t hold when cache is taken into account, you simply have a wrong interpretation of the O-notation. I.e. If all the matrices you ever need are 4x4, n=4, then O-notation with a parameter n is useless (duh) but you still can''t make an assertion that O(n^2) has failed to hold. It''s a notation for *asymptotic* behaviour and with a constant n or n in a small range, we have no asymptocism to measure.

But I can justify the notation outside heavy scientific number crunching as well: for image processing, images sizes easily range from 10x10 to 5000x5000, with no upper limit. There O-notation is useful, though trivial.
quote:
Correct. These are distinct. In fact, the above posters sound like they just learned O-notation
But you sound like you never even learned it

Share on other sites
quote:
Original post by Trienco
use that O-notation stuff to compare complexities of algorithms, not as a weird way to say "cache and parallel execution doesnt matter, its still equally fast".
Hey, I *did* use it to compare the complexities of algorithms, and not for the thing you put in quotes! The two algorithms just did nothing but looped through the array, and since both used roughly equal amount of O(1) lookups, there was no difference in algorithmic complexity. Since Jan claimed that O(n^2) doesn''t hold for such a simple algorithm due to cache, don''t you think that needed straightening out?
quote:
in the end w*h*cacheaccess and w*h*cacheaccess+ w*h*overhead obviously isnt "the same"
Indeed, that''s trivial. But we were talking about O-stuff which can''t be used for speed comparisons, only for algorithmic complexity. So there shouldn''t have been any confusion..

Share on other sites
quote:
Original post by Anonymous Poster
The two algorithms
Just to avoid any confusion: Normally the two ways to loop through an array, where only order of indices differ, wouldn''t count as two different algorithms. But what counts and what doesn''t only depends on the level of detail we''re working on. Since we were interested in if cache affects the algorithmic complexity (due to O-notation used), it was needed to work on such a low level that handles cache and normal memory lookup as separate things.

Share on other sites
At school last term when I learned about this issue with memory access and the effects it has on cache and paging systems I decided to write a small test program that accessed a large 2d array in two different orders. One was sequentially along the rows first then columns [which is the best way] and one the other way, columns first then rows.

The results were horrifying. Just as they said in class that you''d get lots of page misses if you accessed out of sequence the column then row method took significantly longer than the other.

I also did the test in C++ Java and C#. On my computer Java and C# were faster than C++ (everything in default release modes.) When I get home tonight I''ll come back on post the source for each so you guys can try it out. Or you can write up the tests yourself as an exercise

-out

Share on other sites
First, why are you hiding behind AP? It''d be nice to know who I''m talking to, and registering isn''t that hard. Don''t stand behind what you are staying?
Now, in random order:

quote:
But we were talking about O-stuff which can''t be used for speed comparisons, only for algorithmic complexity. So there shouldn''t have been any confusion..
"O-stuff" can be used for whatever you please. Big-O is simply notation that represents a set of functions. You can make a statement about any function (be it size of input -> speed) by saying it''s in O(another function).

quote:
I.e. If all the matrices you ever need are 4x4, n=4, then O-notation with a parameter n is useless

Thank you, that''s what I said.

quote:
It''s a notation for *asymptotic* behaviour and with a constant n or n in a small range, we have no asymptocism to measure.
Duh.

quote:
But I can justify the notation outside heavy scientific number crunching as well: for image processing, images sizes easily range from 10x10 to 5000x5000, with no upper limit.
OK, that''s another example. What I wanted to point out is that O-calculus is useless unless your matrix dimensions go into the hundreds or thousands (asymptotic behavior). We agree on this though.

> There O-notation is useful, though trivial.
huh?!

quote:
Since Jan claimed that O(n^2) doesn''t hold for such a simple algorithm due to cache, don''t you think that needed straightening out?
Yes, please straighten it out. I find nothing wrong with what I said. I suspect you didn''t understand - go back and read it again.
Or I''ll give it another shot - complexity of the task at hand (assuming random access machine!) is N^2, that is patently obvious. Of course you can''t do it in less. However, given a non-random access machine (definition: lookup doesn''t take 1 "operation" unit; extreme example: Turing machine; in our case, a cache), it is easy to find tasks that take much longer (even asymptotically, e.g. N^3 instead of N^2) than the inherent complexity of the problem. Got it now?

quote:
since both used roughly equal amount of O(1) lookups, there was no difference in algorithmic complexity.
This is the flaw in your reasoning. Lookups aren''t necessarily O(1) on non-random access machines. Consider a Turing machine. One algorithm for a matrix lookup might be to keep a base-1 row number on the band, advance to the next end-of-row marker, delete it, scan back to the number, decrement it, and so forth. That''s obviously not constant-time.

wow, I wasted 20 minutes on this post. *sigh*

Share on other sites
quote:
Original post by Jan Wassenberg
"O-stuff" can be used for whatever you please.
It can''t be used for speed comparisons in the way Trienco assumed I had meant it. I can''t deduce that since O(2n) = O(n) that I can do double the work in the time that would be used for one shot. If I could, I could do all the work ever in one second. Maybe I should just specify that some uses of "O-stuff" are plain wrong. I can use toilet paper to make a bridge too but it should be obvious that nobody would.
quote:
Thank you, that''s what I said.
You also said a lot of other things, some (maybe just one) of them were bull and some weren''t. I needed to clear out which weren''t to make a consistent point.
quote:
Duh.
Sorry for lecturing, but I thought you needed it.
quote:
> There O-notation is useful, though trivial.
huh?!
I mean that since processing must be done for each pixel separately, it''s trivial that if you double the amount of pixels you''ll be doing double the work. Non-trivial, useful uses for O-notation would be e.g. figuring out the complexity of a set of sorting algos and using the notation to compare them.
quote:
Yes, please straighten it out. I find nothing wrong with what I said. I suspect you didn''t understand - go back and read it again.
No misunderstanding here at all. See below.
quote:
However, given a non-random access machine (definition: lookup doesn''t take 1 "operation" unit; extreme example: Turing machine; in our case, a cache)
Ah, this is where the error lies. Caches may not have exactly determined 1 "operation" unit behaviour, but their look-up times are bounded (both from up and down) because the caches are not infinite. We can thus define the 1 "operation" unit to be the longest possible look-up time from the cache and notice that caches still have O(1) asymptotic behaviour.

So any currently used cache implementation couldn''t possible make the "speed complexity" (for lack of a better term) to rise to O(n^3) from O(n^2) like you described to be possible. For imaginary caches that are unlimited in size or as big as conventional memory, and linearly accessed, see below:
quote:
This is the flaw in your reasoning.
Coincidentally, it also points out a flaw in your reasoning
quote:
Lookups aren''t necessarily O(1) on non-random access machines.
I agree. However, even in such case your original statement "not when a cache is involved" would be wrong. The statement agrees that the behaviour is O(n^2) when caches aren''t involved but that it is something else when caches are involved. But if you take the turing machine example, the algorithm can take O(n^3) time also *without* a cache. So grasping for straws didn''t work here quite well. It just showed that we''re both wrong when examples get anal (and unrealistic) enough.

Maybe it''s time to end this hairsplitting contest since I doubt we disagree much except maybe some syntax, but I''m still eager to see your reply . I already spotted a place where you could split hairs to a nanometric degree but maybe you won''t do it.

Share on other sites
quote:
Original post by Jan Wassenberg
> There O-notation is useful, though trivial.
huh?!
Haha, I wrote, and then read that statement pretty wrongly! Never mind what I said above wrt this. What I should''ve said there is that in image processing, O-notation *can* be used because the amount of pixels can keep on growing without a limit thus ''adapting'' to the asymptotic behaviour (not exactly adapting since it''s its exact complexity as well), but that in such a case the O-notation would be trivial and it''s usage wouldn''t be very fruitful. "useful" => "could be used"

Share on other sites
Alright, let''s nitpick. The original statements were ''all 4 methods are O(n^2) time'' and ''not if a cache is involved''. Than means ''cache involved => statement is false in general''. To show that, I only have to point out one silly case where the underlying machine changes the number of steps required.

The Turing machine is one, and an ''infinite'' cache is another. By this I mean the lines are about as big as a matrix row (see above post). Since we''re talking about actual machine memory, N remains finite (sorry, no standard big-O let ''em go to infinity here), so you can''t complain the cache isn''t infinite.

What I''m saying here: a cache won''t change this problem''s complexity, but it can change the asymptotic function of operations required (''speed'') as in the Turing example, since realistically, the cache line size K is some significant portion of the matrix row size.
It is not justified to neglect it as a constant factor, except in rare uses where the matrix dimensions go into the thousands. In theory, you are correct about (n -> inf) => cache is a constant factor, but that''s entirely worthless in practice (lacking infinite mem), and therefore the counterexample stands.

Yes, you''re right. This is splitting hairs, which I really don''t have time for. Was fun though

Share on other sites
If yes, you can allocate a "continuous" array (cache friendly) by doing that:

objet **array;array = new objet*[sizey];array[0] = new object[sizex*sizey]; //note that you have only two 'new'.for(f=1,k=sizex;f < sizey ;f++,k+=sizex){array[f] = array[0][k];}

and you can access linearly like that array[0][indice].

Chuck.

[edited by - Chuck3d on February 10, 2004 3:06:41 PM]

Share on other sites
I''ll take one more shot since I have it very clear now what made our views different.
quote:
Original post by Jan Wassenberg
In theory, you are correct about (n -> inf) => cache is a constant factor, but that''s entirely worthless in practice (lacking infinite mem)
Not at all worthless! The cache is still a lot smaller than actual memory, and in applications where the matrices don''t fit in the cache, the speed of a cache-lookup is essentially O(1), and indeed *in practice*. No need for infinite memory to notice this. Processing an image or a heightmap with the size of a few megabytes, for example, would be enough to hit the cache''s upper lookup time and so making it a constant factor. This holds to what you said earlier as well. We can''t hit infinity with any real memory, but we''ll reach the asymptotic behaviour soon enough, as caches are so small when compared to conventional memory. This is what matters IMO. (and this is not nitpicking!)

But yeah, I see how it could be useful to have say the algorithm may behave like O(n^3) if the n is small enough. The notation is still helpful since it says that e.g. when n<10, the ''real'' complexity function may be more than than C*n^3 but as n rises to 10<n<1000, we get the asymptotic behaviour of O(n^3). And once we break the n=1000 barrier and upwards, cache gets small enough and a better approximation can be got by using O(n^2). Maybe I was too eager to think that the n should be thought to be very big (but within realistic bounds of conventional memory), when O(n^2) in the original case would hold with cache or not. But if we want better approximations for *small* n, then we may indeed get O(n^3) when n<1000.

This is precisely where our differences arose. Maybe you only deal with small matrices and arrays, so in your field (or what you say "in practice" the estimate O(n^3) was correct (best approximation as long as n<1000). But for what I had assumed n to be, e.g. a bitmap image, O(n^2) was the correct limit (best approximation as long as 1000<n<100000).

Share on other sites
quote:
Maybe you only deal with small matrices and arrays, so in your field (or what you say "in practice" the estimate O(n^3) was correct (best approximation as long as n<1000). But for what I had assumed n to be, e.g. a bitmap image, O(n^2) was the correct limit (best approximation as long as 1000

Alright, good summary, we''re on the same page here.
Yeah, it''s always funny to see the assumptions we make about certain things we ''always''/''never'' use
BTW, who''s the man behind the AP mask?