Sign in to follow this  

std::vector, L2 cache

This topic is 4374 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi there! this time my question revolves around std::vector<> and the L2 cache... or L2 hit misses. (i'm using intel c++ 9) i got an algorithm working that roughly speaking behaves something like (original algorithm is really different.. but the most computational intensive parts of it behave like this):
struct thestruct
{
   float a,b,c;
   std::vector<float> d;
};


std::vector<thestruct>  vec(.....);
randomize all members of each thestruct in vec
sort vec based on some criteria (doesn't matter for problem statement)
pick a (random) std::vector<float> temp, do some computation with each thestruct::d in vec... 
imagine like e.g. a dot product of temp and each thestruct::d in vec.
the the size of each thestruct::d is always around 5 to 20. now. thing is that with std::vector<> this is very L2 cache bound a problem (yep i measured, yep in release ;)) for small sizes of vec (below around 100) everything runs nicely. however when i set the size of vec to something bigger it gets worse and worse. i suspect that this is due to the fact that each instance of thestruct::d is not in linear order with the rest of thestruct.. sort makes this even worse as it destroys the linearity in memory completely. so even just iterating through all thestructs and accessing each thestruct::d produces lots of L2 cache misses (actually about 70 - 80 % with my configuration). also sort makes some kind of clever software prefetching impossible, i think. according to the profile L2 cache misses are the biggest problem with my algorithm by far. prior to designing everything with std::vector i had an implementation that did all the memory maangement itself.. and always ensured a linear memory layout of vec.
| a,b,c (d1, d2, d3, .... dn) |  a,b,c, (d1, d2, d3, ... dn) .....
however this of course required a lot of hand coded re-ordering of memory (especially after something like std::sort...) so i dropped that solution, as it affected the clarity of my algorithm's implementation a lot. so, is std::vector<thestruct> and L2 cache friendliness like fire and water? thanks a lot for any advice! i'm a bit near to my skill limits with this one.. best regards, simon.

Share this post


Link to post
Share on other sites
I think you've misunderstood how vectors work. A vector's elements are guaranteed to contiguous in memory. Randomly accessing an element is a constant time operation, no different from accessing a standard array. Iterating through a vector is no different from iterating through a standard array.

Vectors become expensive if you're resizing the vector or deleting or reordering elements. Internally they'll be using something like array doubling to reallocate, which will obviously be a lot slower as the vector size grows. If the overhead for these operations is too high then you might want to consider another data structure, like a linked list.

Share this post


Link to post
Share on other sites
Quote:
Original post by thesimon

struct thestruct
{
float a,b,c;
std::vector<float> d;
};


Don't do that. Use d[0],d[1] and d[2] as a,b,c and scrap the struct. d[3] is your d0 then. You can easily randomize or sort parts of vectors.

Share this post


Link to post
Share on other sites
The problem is actually probably std::vector's poor copy semantics which are visible as L2 cache problems rather than inherent data access problems that are caused by the indirection in std::vector's data storage mechanism. What happens when std::sort() does it works is that the assignment operator is called repeatedly for the data type, which for your type involves copying the vector data around. This isn't so efficient. You may want to consider using a different data type with less expensive copy semantics. For example, boost::shared_array or boost::shared_ptr<std::vector>.

Share this post


Link to post
Share on other sites
thanks for your answers!

benjamin bunny, i know that std::vector and standard arrays behave in a similar way. i wouldn't have these l2 problems if i didn't have a vector of vectors as swiftcoder pointed out. reading each thestruct::d in a sorted vector<thestruct> results in a very non-linear acccess pattern.

(thinking loudly here...)
so i have to restructure my data and maybe make one struct containing a,b,c and another struct containing d1,..dn and then have two vectors. one of the first struct and one of the second. if i sort the first vector i lose the correspondence of the two vectors. so i need something like a pointer in the first struct (a,b,c,pointer) into the second struct.

struct s1 { float a,b,c; pointer p; }
struct s2 { std::vector<float> d; }
vector<s1> vec1(N);
vector<s2> vec2(N);


iterating over the first struct i always have a pointer indirection into the second struct when i want to access the d elements. this indirection is again not linear and cache unfriendly... another possibility would be to reorder the second vector based on the new sorted order of the first one...... this sucks.
having two vectors containing data of what is actually a single struct is also very unintuitive...

i could also precompute the dot product for each element of the second vector and then while iterating over the first sorted vector look up the result of the dot product in a table based on pointer p (which would be small enough to more likely fit into L2....) this way i would only access vec1 and vec2 in a linear predictable way... the only thing non-linear would be the lookup of the precomputed dot-product.. which should be L2 friendly.

is this the way to go ? i know my desciption here is quite confusing.. i tried my best.....
haven't already thought through how this will affect my implementation... but breaking down the structs intwo two smaller structs (thus avoiding std::vector<std::vector<> > seems desirable....)

does this sound reasonable (i'll go ahead and try implementing a benchmark... trying to judge by the numbers).

thanks again!
simon.

Share this post


Link to post
Share on other sites
Quote:
Original post by Trap
Doesn't sort() use swap<>() which is efficient on vectors?


Not specified by the standard. And even it was you'd need to specialize swap() for thestruct since even if std::sort() did use swap(), swap() on thestruct would use assignment anyways.

Share this post


Link to post
Share on other sites
I would just use:
vector<vector<float> > vec;

And store a,b and c at the start of every vec[i], that way you have 1 indirection less.

Share this post


Link to post
Share on other sites
AAh. thanks for the other two replies! they came in while i was typing the above post.

SiCrane, i remember from my previous attempt at this (doing memory layout myself) that it was by far faster then the vector version.. this could very well be because of the fact that my custom layout when passed to sort() would result in a lot less copies (the second (d) vector wouldn't be copied around then.. so my previous attempt was essentially the same as using boost::shared_ptr<std::vector<> > as you suggested... only the a,b,c members and the pointer got copied around by sort() the d vector was never touched (reducing the amount of data moved around to 10 - 20 %).

i will try your boost::shared_ptr<> suggestion first before i go ahead and split the struct into two, as the impact on my code of your suggestion is non-existent :) which i like :)

Trap, your solution would suffer from the same 'poor copy semantics' when feeding it into sort.. if SiCrane is correct, this is (a big) part of the problem...

i'll give it a try!

regards,
simon

Share this post


Link to post
Share on other sites
The compiler might use swap() to sort a vector<vector<float> >. If it does it is the fastest way to code it. If it does not using shared_ptr seems a good idea to me.

Share this post


Link to post
Share on other sites
Trap, you might be correct about sort() using swap() under certain circumstances (haven't checked it myself yet, but i will) however i only used sort() as an example function of something that heavily distrubs the vector (and especially the distance in memory of each thestruct to its thestruct::d member).. my algorithm actually does something different.
i'm trying to get some measurements with the boost::shared_ptr<> right now. will report :)

Share this post


Link to post
Share on other sites
Read my postings in context, I don't want to use structs at all. I would scrap the struct and put a,b and c in front of the d-vector. That way a,b,c,d0,d1,... are always next to each other (in one vector<float>).

If you don't use std::sort that's even better because you don't have to worry whether it uses swap or not. In the functions you write you can always use swap.

Share this post


Link to post
Share on other sites
there's a few percent in speed improvement when going with SiCrane's idea of having a boost::shared_ptr<std::vector<> > inside 'thestruct'.
definitely something i'm very thankful about.. 'under the hood' more stuff gets copied around than you'd think (in some places in the app i do actually use sort and partial_sort, so speed gain comes from there).

measurements also have shown that the worst problem i have i still a 64k cache line split / L2 cache miss one.

to spill a bit more concrete code (off the top of my head, no guarantees):


struct thestruct // almost same as in O.P.
{
float a,b,c;
boost::smart_ptr<std::vector<float> > d;
};

template<class VecType, class TheStruct>
struct compute_distance
{
compute_distance(VecType &temp) : temp_(temp) {}
~compute_distance() {}
float operator () (const TheStruct &ts)
{
float dist = 0.0f;
for(size_t i=0; i < N; ++i) // let N be the size of each thestruct::d field
// for simplicity's sake.
{
float d = ts->d[i] - temp_->d[i];
dist += d * d;
}
return sqrt(dist);
}
VecType temp_;
};

// ...................

typedef std::vector<float> d_vec_type;
d_vec_type temp = vec[0].d;
thestruct::iterator it = min_element(vec.begin(), vec.end()
, compute_distance<d_vec_type, thestruct>(temp));


the call to min_element() has by far the highest clock count and by far the highest L2 cache miss count (remember that each access of thestruct::d is a journey through memory land with a very scattered destination).

Trap, i'm somwhat reluctant of giving up the naming of thestruct members.. they have a very distinct meaning.. i 'd have to define constants and code would become quite a bit less readable. as this code is supposed to become a library i don't want to introduce so much confusion. thanks for the advice, though. i suppose it would eliminate my cache line splits / l2 problem. it's just too high a price for my taste... things woudl be different if a,b,c were vertex data or something... but as i said. their different meaning is quite important.

seems like splitting thestruct into two structs and going with my idea (one before my previous post) of precomputing the distances in an out-of-order way and then looking up the results in a table is the way to go?

regards,
simon.

Share this post


Link to post
Share on other sites
Before you try that you can probably get better cache performance by use boost::pool to allocate the std::vectors that you pass to the shared pointers. Just remember to specify the destruction function to the shared pointer constructor properly. For that matter, if your vectors usually hold roughly the same number of elements then you might be able to use a pool allocator instead of the standard allocator to get better cache coherency as well.

Share this post


Link to post
Share on other sites
Right, SiCrane. introducing allocators was on my list. profiling runs have shown that construction cost and general cost associated with vectors contributes quite a bit. have to read up on boost pooled allocators a bit, though.

the idea of splitting thestruct doesn't appeal to me so much. it may be very beneficial to performance, but very disruptive to the code. maybe so much that i won't even try and implement it.

it just bothers me that such a simple operation is the most costly part of an algorithm due to it's memory spread...

thanks again :) i'll make the vectors use pooled allocators and see how it affects performance...

regards,
simon.

Share this post


Link to post
Share on other sites
I'm guessing that actually you'll get more benefit out of allocating the vectors from pools than making the vectors use pooled allocators. (Though the latter is probably slightly easier.)

Share this post


Link to post
Share on other sites
Quote:
Original post by thesimon
Trap, i'm somwhat reluctant of giving up the naming of thestruct members.. they have a very distinct meaning.. i 'd have to define constants and code would become quite a bit less readable. as this code is supposed to become a library i don't want to introduce so much confusion. thanks for the advice, though. i suppose it would eliminate my cache line splits / l2 problem. it's just too high a price for my taste... things woudl be different if a,b,c were vertex data or something... but as i said. their different meaning is quite important.
In that case, perhaps you'd consider changing the std::vector<float> to a plain old array of floats, which will then of course reside just after the other 3 floats in memory. All assuming that each struct has a vector of the same size, of course. You can still use std algorithms on it.
I'm sure it would also give you a reasonable speed boost. Don't forget that although the per-element overhead in a vector may be zero, there is definately some per-vector overhead for storing the pointer to the array, the size, and perhaps capacity etc.

Oh except that your structure copies could become more expensive, so you'd perhaps want to only store pointers to them in the main vector. Even not doing so might still be a winner though. It depends on N.

Share this post


Link to post
Share on other sites
Hi iMalc!

if the N in the above code snippet suggested that N is a compile time constant i appologize for that. it was just used to keep things simple.

the size of thestruct::p is the same for each element of std::vector<thestruct>, however, it is not known at compile time (i'd have gone a completely different way if it was). so i can't use a 'normal' array or boost::array<>. speed would indeed be excellent if i could use that... it would result in a linear memory layout that could only be achieved with a lot of plumbing for the case of a run-time N (which i did quite a while ago... but which i dropped because of ugliness :)

cheers,
simon.

Share this post


Link to post
Share on other sites
I actually had such kind of thing in one of my programs, in time-critical part (there is never enough speed for software rendering)

I did it as
struct blabla{
float blabla2;
int size;
somestruct data[1];//actual number of allocated things is size. 1 is for compilance with compilers that don't allow you to put zero there.
}
then i allocated it with malloc so size of allocated block is
sizeof(blabla)+sizeof(somestruct)*(size-1)
where size is number of items to store in data

As about using it with sort, do you know that you can sort c arrays with something like
int a[100];
...
std::sort(a,a+100);
?
There is not much problem working with arrays. Pointers work the same as iterators, actually (in fact std::vector's iterators _is_ pointers at least in some implementations), so all or almost all STL algorithms can be used on c arrays.

Quote:
Original post by benjamin bunny
I think you've misunderstood how vectors work. A vector's elements are guaranteed to contiguous in memory. Randomly accessing an element is a constant time operation, no different from accessing a standard array. Iterating through a vector is no different from iterating through a standard array.

Vectors become expensive if you're resizing the vector or deleting or reordering elements. Internally they'll be using something like array doubling to reallocate, which will obviously be a lot slower as the vector size grows. If the overhead for these operations is too high then you might want to consider another data structure, like a linked list.

Vector stores actual data on heap, so if you have such kind of struct, abc will be in where struct is placed, and vector's body will be on heap and even if struct is itself on heap chances are vector's body will be somewhere else.
If you read many of such things, it is like random memory accesses, i.e. very poor caching.

Share this post


Link to post
Share on other sites
Quote:
Original post by Dmytry
Vector stores actual data on heap, so if you have such kind of struct, abc will be in where struct is placed, and vector's body will be on heap and even if struct is itself on heap chances are vector's body will be somewhere else.
If you read many of such things, it is like random memory accesses, i.e. very poor caching.


benjamin didn't mean that, he meant (as of C++ 2003 TC1) std::vector must hold it elements in memory contiguously and std::vector (as with the rest of the standard library containers) store there data where every it's allocator type stores them, for instance if you used GCC's custom allocator type array_allocator then it's not on heap it's typically on the stack.

Share this post


Link to post
Share on other sites
Quote:
Original post by snk_kid
Quote:
Original post by Dmytry
Vector stores actual data on heap, so if you have such kind of struct, abc will be in where struct is placed, and vector's body will be on heap and even if struct is itself on heap chances are vector's body will be somewhere else.
If you read many of such things, it is like random memory accesses, i.e. very poor caching.


benjamin didn't mean that, he meant (as of C++ 2003 TC1) std::vector must hold it elements in memory contiguously and std::vector (as with the rest of the standard library containers) store there data where every it's allocator type stores them, for instance if you used GCC's custom allocator type array_allocator then it's not on heap it's typically on the stack.

i know the standard thing, the point is that "data" contained in vector is not continuous with "a,b,c", nor with vector's own data (pointer and size), and it's the kind of non-continuousness OP are concerned about

Share this post


Link to post
Share on other sites
thanks for your replies, Dmytry and snk_kid!

Dmytry is correct about my concerns stated in the original post. However, Dmytry, your solution is quite similar to my previous attempt in managing the memory layout myself. It is true that using plain old malloc i can achieve a linear memory layout... so when iterating over the whole range i have nice data locality.

in my previous attempt i had something like

struct blabla
{
float blabla2;
float *s;
};
then reserve memory the size of ( num_blabla * (sizeof(blabla) + sizeof(float)*n) )

i used to have a custom iterator that hides jumping from one instance of blabla to the other in my malloc'ed array....

but your (and mine) solution seems to suffer from being sorted. i know c++ arrays can be sorted using the standard algorithms. .. it seems to me that during sort the actual memory locations of the somestruct data[1] members are never changed... or is the blabla struct always entirely copied (with all data)? ( i know in the above code snippet the array pointed to by float *s is never copied by sort() as sort only sees the pointer value itself, not the data pointer to by it ) does your code behave differently?

thanks for your replies,
simon

Share this post


Link to post
Share on other sites
Quote:
Original post by thesimon
thanks for your replies, Dmytry and snk_kid!

Dmytry is correct about my concerns stated in the original post. However, Dmytry, your solution is quite similar to my previous attempt in managing the memory layout myself. It is true that using plain old malloc i can achieve a linear memory layout... so when iterating over the whole range i have nice data locality.

in my previous attempt i had something like

struct blabla
{
float blabla2;
float *s;
};
then reserve memory the size of ( num_blabla * (sizeof(blabla) + sizeof(float)*n) )

i used to have a custom iterator that hides jumping from one instance of blabla to the other in my malloc'ed array....

but your (and mine) solution seems to suffer from being sorted. i know c++ arrays can be sorted using the standard algorithms. .. it seems to me that during sort the actual memory locations of the somestruct data[1] members are never changed... or is the blabla struct always entirely copied (with all data)? ( i know in the above code snippet the array pointed to by float *s is never copied by sort() as sort only sees the pointer value itself, not the data pointer to by it ) does your code behave differently?

thanks for your replies,
simon


My code is different - i have an array, you have pointer.
With my solution, let i have

struct my_struct{
int a,b,c;
float x,y,z;
int data_size;// note: it is not necessary to have size there, but to show the concept
float data[1];
};
...
...
Let's look how it lays out in memory:

0 4 8 12 16 20 24 28 ! 32 36 40
a b c x y z data_size data[0] data[1] data[2] data[3] ...
The "!" marks end of "structure". But you can access memory past end of structure using data.
Example:
If you just use
my_struct something;
int foobar;
something.data[0]=1;//okay
something.data[1]=1;//overwrites foobar
Of course buffer overflow it's not exactly what we want, so we need to reserve
enough memory right after the struct so if we use data[1] it doesn't overwrite
something that we don't want to overwrite.
So, we need to:
my_struct *pointer_to_my_struct=malloc(sizeof(my_struct) + sizeof(float)*(my_array_length-1));
pointer_to_my_struct->data_size=my_array_length;// just for example, i'll use
it in sorting

Now the "pointer_to_my_struct" points to beginning of region in memory that
stores my_struct so array "on the end" is my_array_length items long.
The items is addressed with
pointer_to_my_struct->data[0] , pointer_to_my_struct->data[1], and so on.

Sorting:
data in my_struct can be sorted with
std::sort(pointer_to_my_struct->data,pointer_to_my_struct->data+pointer_to_my_struct->data_size);


For storing array of my_struct, you can use vector, like
std::vector<my_struct* > vec;
(could be somewhat better than original cache wise)
You can allocate space for all my_struct at once so you will not have gaps. Actually that's somewhat like how i did it, i allocate large piece of data where i put many instances of my_struct(differently sized), and i also have array with pointers to individual my_struct for easier iteration through them.

[For me vector was not an option because whole thing has many arrays with about 3..10 items and there can be hundred millions of such arrays with whole thing sized about gigabytes. Cache was not big issue because it can't fit in cache no matter what.]

[Edited by - Dmytry on December 17, 2005 6:43:04 PM]

Share this post


Link to post
Share on other sites
yes i get it, thanks for the explanation, Dmytry.

but what if sorting the whole array (that's what is interesting to me.. the my_struct::data does not need to be sorted).
.. doesn't this corrupt your memory layout (and even makes it unusable thereafter)? i mean... when sorting an array of my_structs doesn't sort ignore the my_structs::data part of it completely?

my_struct *thestructs = malloc(......); // just as you did
thestructs[i].data[0]; // ok: valid access, for some i

sort(thestructs, thestructs + number_of_mystructs); // assume my_struct has an operator < defined

thestructs[i].data[0]; // not ok, imo. it won't result in an overflow but
// the data is not necessarily belonging to thestructs[i]. it is the data
// that was in that place before sorting!

am i right there!?

(i'm so confused right now... if i'm not right, i'll sleep over this and come back tomorrow... sorry, dudes)

simon.

Share this post


Link to post
Share on other sites

This topic is 4374 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this