std::vector, L2 cache

Started by
36 comments, last by thesimon 18 years, 3 months ago
hi there! this time my question revolves around std::vector<> and the L2 cache... or L2 hit misses. (i'm using intel c++ 9) i got an algorithm working that roughly speaking behaves something like (original algorithm is really different.. but the most computational intensive parts of it behave like this):

struct thestruct
{
   float a,b,c;
   std::vector<float> d;
};


std::vector<thestruct>  vec(.....);
randomize all members of each thestruct in vec
sort vec based on some criteria (doesn't matter for problem statement)
pick a (random) std::vector<float> temp, do some computation with each thestruct::d in vec... 
imagine like e.g. a dot product of temp and each thestruct::d in vec.
the the size of each thestruct::d is always around 5 to 20. now. thing is that with std::vector<> this is very L2 cache bound a problem (yep i measured, yep in release ;)) for small sizes of vec (below around 100) everything runs nicely. however when i set the size of vec to something bigger it gets worse and worse. i suspect that this is due to the fact that each instance of thestruct::d is not in linear order with the rest of thestruct.. sort makes this even worse as it destroys the linearity in memory completely. so even just iterating through all thestructs and accessing each thestruct::d produces lots of L2 cache misses (actually about 70 - 80 % with my configuration). also sort makes some kind of clever software prefetching impossible, i think. according to the profile L2 cache misses are the biggest problem with my algorithm by far. prior to designing everything with std::vector i had an implementation that did all the memory maangement itself.. and always ensured a linear memory layout of vec.

| a,b,c (d1, d2, d3, .... dn) |  a,b,c, (d1, d2, d3, ... dn) .....
however this of course required a lot of hand coded re-ordering of memory (especially after something like std::sort...) so i dropped that solution, as it affected the clarity of my algorithm's implementation a lot. so, is std::vector<thestruct> and L2 cache friendliness like fire and water? thanks a lot for any advice! i'm a bit near to my skill limits with this one.. best regards, simon.
Advertisement
I think you've misunderstood how vectors work. A vector's elements are guaranteed to contiguous in memory. Randomly accessing an element is a constant time operation, no different from accessing a standard array. Iterating through a vector is no different from iterating through a standard array.

Vectors become expensive if you're resizing the vector or deleting or reordering elements. Internally they'll be using something like array doubling to reallocate, which will obviously be a lot slower as the vector size grows. If the overhead for these operations is too high then you might want to consider another data structure, like a linked list.

____________________________________________________________www.elf-stone.com | Automated GL Extension Loading: GLee 5.00 for Win32 and Linux

std::vectors of std::vector (which is basically what you are doing) are not a particullarly good idea, as the memory is not contigous as it would be in a multi-dimensional C-array.
Perhaps it would be best to restructure your data?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Quote:Original post by thesimon
struct thestruct{   float a,b,c;   std::vector<float> d;};


Don't do that. Use d[0],d[1] and d[2] as a,b,c and scrap the struct. d[3] is your d0 then. You can easily randomize or sort parts of vectors.
The problem is actually probably std::vector's poor copy semantics which are visible as L2 cache problems rather than inherent data access problems that are caused by the indirection in std::vector's data storage mechanism. What happens when std::sort() does it works is that the assignment operator is called repeatedly for the data type, which for your type involves copying the vector data around. This isn't so efficient. You may want to consider using a different data type with less expensive copy semantics. For example, boost::shared_array or boost::shared_ptr<std::vector>.
Doesn't sort() use swap<>() which is efficient on vectors?
thanks for your answers!

benjamin bunny, i know that std::vector and standard arrays behave in a similar way. i wouldn't have these l2 problems if i didn't have a vector of vectors as swiftcoder pointed out. reading each thestruct::d in a sorted vector<thestruct> results in a very non-linear acccess pattern.

(thinking loudly here...)
so i have to restructure my data and maybe make one struct containing a,b,c and another struct containing d1,..dn and then have two vectors. one of the first struct and one of the second. if i sort the first vector i lose the correspondence of the two vectors. so i need something like a pointer in the first struct (a,b,c,pointer) into the second struct.
struct s1 { float a,b,c; pointer p; }struct s2 { std::vector<float> d; }vector<s1> vec1(N);vector<s2> vec2(N);


iterating over the first struct i always have a pointer indirection into the second struct when i want to access the d elements. this indirection is again not linear and cache unfriendly... another possibility would be to reorder the second vector based on the new sorted order of the first one...... this sucks.
having two vectors containing data of what is actually a single struct is also very unintuitive...

i could also precompute the dot product for each element of the second vector and then while iterating over the first sorted vector look up the result of the dot product in a table based on pointer p (which would be small enough to more likely fit into L2....) this way i would only access vec1 and vec2 in a linear predictable way... the only thing non-linear would be the lookup of the precomputed dot-product.. which should be L2 friendly.

is this the way to go ? i know my desciption here is quite confusing.. i tried my best.....
haven't already thought through how this will affect my implementation... but breaking down the structs intwo two smaller structs (thus avoiding std::vector<std::vector<> > seems desirable....)

does this sound reasonable (i'll go ahead and try implementing a benchmark... trying to judge by the numbers).

thanks again!
simon.
Quote:Original post by Trap
Doesn't sort() use swap<>() which is efficient on vectors?


Not specified by the standard. And even it was you'd need to specialize swap() for thestruct since even if std::sort() did use swap(), swap() on thestruct would use assignment anyways.
I would just use:
vector<vector<float> > vec;

And store a,b and c at the start of every vec, that way you have 1 indirection less.
AAh. thanks for the other two replies! they came in while i was typing the above post.

SiCrane, i remember from my previous attempt at this (doing memory layout myself) that it was by far faster then the vector version.. this could very well be because of the fact that my custom layout when passed to sort() would result in a lot less copies (the second (d) vector wouldn't be copied around then.. so my previous attempt was essentially the same as using boost::shared_ptr<std::vector<> > as you suggested... only the a,b,c members and the pointer got copied around by sort() the d vector was never touched (reducing the amount of data moved around to 10 - 20 %).

i will try your boost::shared_ptr<> suggestion first before i go ahead and split the struct into two, as the impact on my code of your suggestion is non-existent :) which i like :)

Trap, your solution would suffer from the same 'poor copy semantics' when feeding it into sort.. if SiCrane is correct, this is (a big) part of the problem...

i'll give it a try!

regards,
simon

This topic is closed to new replies.

Advertisement