std::vector, L2 cache
The compiler might use swap() to sort a vector<vector<float> >. If it does it is the fastest way to code it. If it does not using shared_ptr seems a good idea to me.
Trap, you might be correct about sort() using swap() under certain circumstances (haven't checked it myself yet, but i will) however i only used sort() as an example function of something that heavily distrubs the vector (and especially the distance in memory of each thestruct to its thestruct::d member).. my algorithm actually does something different.
i'm trying to get some measurements with the boost::shared_ptr<> right now. will report :)
i'm trying to get some measurements with the boost::shared_ptr<> right now. will report :)
Read my postings in context, I don't want to use structs at all. I would scrap the struct and put a,b and c in front of the d-vector. That way a,b,c,d0,d1,... are always next to each other (in one vector<float>).
If you don't use std::sort that's even better because you don't have to worry whether it uses swap or not. In the functions you write you can always use swap.
If you don't use std::sort that's even better because you don't have to worry whether it uses swap or not. In the functions you write you can always use swap.
there's a few percent in speed improvement when going with SiCrane's idea of having a boost::shared_ptr<std::vector<> > inside 'thestruct'.
definitely something i'm very thankful about.. 'under the hood' more stuff gets copied around than you'd think (in some places in the app i do actually use sort and partial_sort, so speed gain comes from there).
measurements also have shown that the worst problem i have i still a 64k cache line split / L2 cache miss one.
to spill a bit more concrete code (off the top of my head, no guarantees):
the call to min_element() has by far the highest clock count and by far the highest L2 cache miss count (remember that each access of thestruct::d is a journey through memory land with a very scattered destination).
Trap, i'm somwhat reluctant of giving up the naming of thestruct members.. they have a very distinct meaning.. i 'd have to define constants and code would become quite a bit less readable. as this code is supposed to become a library i don't want to introduce so much confusion. thanks for the advice, though. i suppose it would eliminate my cache line splits / l2 problem. it's just too high a price for my taste... things woudl be different if a,b,c were vertex data or something... but as i said. their different meaning is quite important.
seems like splitting thestruct into two structs and going with my idea (one before my previous post) of precomputing the distances in an out-of-order way and then looking up the results in a table is the way to go?
regards,
simon.
definitely something i'm very thankful about.. 'under the hood' more stuff gets copied around than you'd think (in some places in the app i do actually use sort and partial_sort, so speed gain comes from there).
measurements also have shown that the worst problem i have i still a 64k cache line split / L2 cache miss one.
to spill a bit more concrete code (off the top of my head, no guarantees):
struct thestruct // almost same as in O.P.{ float a,b,c; boost::smart_ptr<std::vector<float> > d;};template<class VecType, class TheStruct>struct compute_distance{ compute_distance(VecType &temp) : temp_(temp) {} ~compute_distance() {} float operator () (const TheStruct &ts) { float dist = 0.0f; for(size_t i=0; i < N; ++i) // let N be the size of each thestruct::d field // for simplicity's sake. { float d = ts->d - temp_->d; dist += d * d; } return sqrt(dist); } VecType temp_;};// ...................typedef std::vector<float> d_vec_type;d_vec_type temp = vec[0].d;thestruct::iterator it = min_element(vec.begin(), vec.end() , compute_distance<d_vec_type, thestruct>(temp));
the call to min_element() has by far the highest clock count and by far the highest L2 cache miss count (remember that each access of thestruct::d is a journey through memory land with a very scattered destination).
Trap, i'm somwhat reluctant of giving up the naming of thestruct members.. they have a very distinct meaning.. i 'd have to define constants and code would become quite a bit less readable. as this code is supposed to become a library i don't want to introduce so much confusion. thanks for the advice, though. i suppose it would eliminate my cache line splits / l2 problem. it's just too high a price for my taste... things woudl be different if a,b,c were vertex data or something... but as i said. their different meaning is quite important.
seems like splitting thestruct into two structs and going with my idea (one before my previous post) of precomputing the distances in an out-of-order way and then looking up the results in a table is the way to go?
regards,
simon.
Before you try that you can probably get better cache performance by use boost::pool to allocate the std::vectors that you pass to the shared pointers. Just remember to specify the destruction function to the shared pointer constructor properly. For that matter, if your vectors usually hold roughly the same number of elements then you might be able to use a pool allocator instead of the standard allocator to get better cache coherency as well.
Right, SiCrane. introducing allocators was on my list. profiling runs have shown that construction cost and general cost associated with vectors contributes quite a bit. have to read up on boost pooled allocators a bit, though.
the idea of splitting thestruct doesn't appeal to me so much. it may be very beneficial to performance, but very disruptive to the code. maybe so much that i won't even try and implement it.
it just bothers me that such a simple operation is the most costly part of an algorithm due to it's memory spread...
thanks again :) i'll make the vectors use pooled allocators and see how it affects performance...
regards,
simon.
the idea of splitting thestruct doesn't appeal to me so much. it may be very beneficial to performance, but very disruptive to the code. maybe so much that i won't even try and implement it.
it just bothers me that such a simple operation is the most costly part of an algorithm due to it's memory spread...
thanks again :) i'll make the vectors use pooled allocators and see how it affects performance...
regards,
simon.
I'm guessing that actually you'll get more benefit out of allocating the vectors from pools than making the vectors use pooled allocators. (Though the latter is probably slightly easier.)
Quote:Original post by thesimonIn that case, perhaps you'd consider changing the std::vector<float> to a plain old array of floats, which will then of course reside just after the other 3 floats in memory. All assuming that each struct has a vector of the same size, of course. You can still use std algorithms on it.
Trap, i'm somwhat reluctant of giving up the naming of thestruct members.. they have a very distinct meaning.. i 'd have to define constants and code would become quite a bit less readable. as this code is supposed to become a library i don't want to introduce so much confusion. thanks for the advice, though. i suppose it would eliminate my cache line splits / l2 problem. it's just too high a price for my taste... things woudl be different if a,b,c were vertex data or something... but as i said. their different meaning is quite important.
I'm sure it would also give you a reasonable speed boost. Don't forget that although the per-element overhead in a vector may be zero, there is definately some per-vector overhead for storing the pointer to the array, the size, and perhaps capacity etc.
Oh except that your structure copies could become more expensive, so you'd perhaps want to only store pointers to them in the main vector. Even not doing so might still be a winner though. It depends on N.
Hi iMalc!
if the N in the above code snippet suggested that N is a compile time constant i appologize for that. it was just used to keep things simple.
the size of thestruct::p is the same for each element of std::vector<thestruct>, however, it is not known at compile time (i'd have gone a completely different way if it was). so i can't use a 'normal' array or boost::array<>. speed would indeed be excellent if i could use that... it would result in a linear memory layout that could only be achieved with a lot of plumbing for the case of a run-time N (which i did quite a while ago... but which i dropped because of ugliness :)
cheers,
simon.
if the N in the above code snippet suggested that N is a compile time constant i appologize for that. it was just used to keep things simple.
the size of thestruct::p is the same for each element of std::vector<thestruct>, however, it is not known at compile time (i'd have gone a completely different way if it was). so i can't use a 'normal' array or boost::array<>. speed would indeed be excellent if i could use that... it would result in a linear memory layout that could only be achieved with a lot of plumbing for the case of a run-time N (which i did quite a while ago... but which i dropped because of ugliness :)
cheers,
simon.
I actually had such kind of thing in one of my programs, in time-critical part (there is never enough speed for software rendering)
I did it as
struct blabla{
float blabla2;
int size;
somestruct data[1];//actual number of allocated things is size. 1 is for compilance with compilers that don't allow you to put zero there.
}
then i allocated it with malloc so size of allocated block is
sizeof(blabla)+sizeof(somestruct)*(size-1)
where size is number of items to store in data
As about using it with sort, do you know that you can sort c arrays with something like
int a[100];
...
std::sort(a,a+100);
?
There is not much problem working with arrays. Pointers work the same as iterators, actually (in fact std::vector's iterators _is_ pointers at least in some implementations), so all or almost all STL algorithms can be used on c arrays.
Vector stores actual data on heap, so if you have such kind of struct, abc will be in where struct is placed, and vector's body will be on heap and even if struct is itself on heap chances are vector's body will be somewhere else.
If you read many of such things, it is like random memory accesses, i.e. very poor caching.
I did it as
struct blabla{
float blabla2;
int size;
somestruct data[1];//actual number of allocated things is size. 1 is for compilance with compilers that don't allow you to put zero there.
}
then i allocated it with malloc so size of allocated block is
sizeof(blabla)+sizeof(somestruct)*(size-1)
where size is number of items to store in data
As about using it with sort, do you know that you can sort c arrays with something like
int a[100];
...
std::sort(a,a+100);
?
There is not much problem working with arrays. Pointers work the same as iterators, actually (in fact std::vector's iterators _is_ pointers at least in some implementations), so all or almost all STL algorithms can be used on c arrays.
Quote:Original post by benjamin bunny
I think you've misunderstood how vectors work. A vector's elements are guaranteed to contiguous in memory. Randomly accessing an element is a constant time operation, no different from accessing a standard array. Iterating through a vector is no different from iterating through a standard array.
Vectors become expensive if you're resizing the vector or deleting or reordering elements. Internally they'll be using something like array doubling to reallocate, which will obviously be a lot slower as the vector size grows. If the overhead for these operations is too high then you might want to consider another data structure, like a linked list.
Vector stores actual data on heap, so if you have such kind of struct, abc will be in where struct is placed, and vector's body will be on heap and even if struct is itself on heap chances are vector's body will be somewhere else.
If you read many of such things, it is like random memory accesses, i.e. very poor caching.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement