mikeman, you could be running out of stack: there is constant size storage (8 objects) for small vectors. 8*64K = 512K
Do I need to learn the Standard C++ Library???
>You create an entire new list class just to ensure that your node fits in an >eight byte block of memory when it stores an int, but don't bother profiling it when it
There are more design choises that went into the design that are apparent with quick glance.
Here are further thing of consideration:
1. Each dynamic heap allocation has static overhead, defeatable with a custom allocator, though. I recall it was 16 bytes per allocation, or that ballpark, I did measurements and profiling when designing this code - it didn't turn out in information vacuum. That brings us up to 24 bytes, minimum, per node.
With our arrangement, the overhead is (16+8)/16 bytes per allocation (1.5 bytes!). This means our memory overhead is 16 times smaller(!) for the list infrastructure. It was *additional* benefit that on top of that, we can fit one "int" worth of payload there FOR FREE.
2. The allocations are contiguous with vector-like cache behaviour, you can observe that in profiling the core::list could be put head-to-head with std::vector (!) and hold it's ground quite well in most situations.
It was not to "ensure" that my node fits in an eight byte block.
>stores anything but an int [which should be considered the standard situation].
Feel free to profile with different types, similiar behaviour can be observed. If you think that I write code without knowing what I am doing you are mistaken.
>You provide for custom allocators in your type definition, but don't bother >actually using them.
It has nothing to do with bothering, the code will be written when I get down to it. There is still work left on this library, no shit.
> You jump through elaborate hoops in order to get around your disregard for >type safety.
int* a = ...;
int* b = 0;
ptrdiff_t x = a - b;
That is a fully legal c++ construct. I should do a code review and remove some reinterpret_casts's which are not really needed there, true.
> You don't document anything, least of all the aforementioned hoops.
A good point. I would write documentation if I were more interested in the documentation than actually using the code.
I wouldn't make the code available at all unless I didn't get too many requests all the time to "send the latest version" or "can I see it" from old customers, colleagues and users of the previous commercial package. So I made it available for download without registration or too complicated licensing schemes.
>*edit: I'm going to go out on a limb and suggest that your xor optimization >provided marginal benefit overall, and that the added complexity for every >single list operation completely outweighs the benefit it did provide. I'm
Example case, iterator ++
temp = current;
current = current->step(source);
source = temp;
Where step is essentially "(a - 0) ^ b", the -0 is a constant literal so it is optimized away (you can check this with your non-stoneage compiler) so what's left is a^b, this is not essentially slower than assignment:
current = current->next;
The "cost" lies in storing a temporary. <:() Interesting observation, since we are doing xor, the output port (-register) in machine language must be a register in most architechtures. This is stored in current, so it makes sense that current in register during all of the graph after register allocation, this block is so small that it would be insane code generator to store register in memory followed by load from same memory. :)
The "temp" can be memory if it is spilled, that is case by case basis, but even so, it should reside in top-of-stack, which in turn should be in cache most of the time, L1 on top of that.
>instead going to posit that your profile results [assuming they are valid, I >can't bring myself to look at that file in any detail after reading through >your list header] are entirely the byproduct of your pooled allocator. Remove >the pooled allocator, and rerun your tests. I'd put money on the standard >library out performing you.
Your claim is easy to satisfy, here is the changes to the code required:
node* node_allocate() { return new node(); }
void node_release(node* p) { delete node; }
How much were you betting, again? I'd like to have it, please! Here is the result:
std::list 70.1866
core::list 60.7829
FYI, the node allocation abstraction being a discrete method isn't accident or pure chance.
Other note is that custom allocator still has more overhead than this arrangement, allocation is typically:
if ( a op b )
return memory_dereference;
Yes, I also tested custom allocator overhead. With std -and- core list implementations. The custom allocator is not enabled at this time because the custom allocator was implemented before placement-new and delete for raw memory.
I yet haven't found time to integrate that branch back in. So unprofessional, I know.
I'll fix this thanks or the feedback & motivation boost. Last year I worked on this maybe a week or two, max, as I went to Philippines for 8 months and worked from there to my parent company (which will change *again* next autumn due re-structuring this time by AMD ;)
FYI, I used to maintain the previous version of the library as fulltime job. Not anymore, now it's a hobby project as I am doing OpenGL drivers daytime and night time, well, censored. >B)
There are more design choises that went into the design that are apparent with quick glance.
Here are further thing of consideration:
1. Each dynamic heap allocation has static overhead, defeatable with a custom allocator, though. I recall it was 16 bytes per allocation, or that ballpark, I did measurements and profiling when designing this code - it didn't turn out in information vacuum. That brings us up to 24 bytes, minimum, per node.
With our arrangement, the overhead is (16+8)/16 bytes per allocation (1.5 bytes!). This means our memory overhead is 16 times smaller(!) for the list infrastructure. It was *additional* benefit that on top of that, we can fit one "int" worth of payload there FOR FREE.
2. The allocations are contiguous with vector-like cache behaviour, you can observe that in profiling the core::list could be put head-to-head with std::vector (!) and hold it's ground quite well in most situations.
It was not to "ensure" that my node fits in an eight byte block.
>stores anything but an int [which should be considered the standard situation].
Feel free to profile with different types, similiar behaviour can be observed. If you think that I write code without knowing what I am doing you are mistaken.
>You provide for custom allocators in your type definition, but don't bother >actually using them.
It has nothing to do with bothering, the code will be written when I get down to it. There is still work left on this library, no shit.
> You jump through elaborate hoops in order to get around your disregard for >type safety.
int* a = ...;
int* b = 0;
ptrdiff_t x = a - b;
That is a fully legal c++ construct. I should do a code review and remove some reinterpret_casts's which are not really needed there, true.
> You don't document anything, least of all the aforementioned hoops.
A good point. I would write documentation if I were more interested in the documentation than actually using the code.
I wouldn't make the code available at all unless I didn't get too many requests all the time to "send the latest version" or "can I see it" from old customers, colleagues and users of the previous commercial package. So I made it available for download without registration or too complicated licensing schemes.
>*edit: I'm going to go out on a limb and suggest that your xor optimization >provided marginal benefit overall, and that the added complexity for every >single list operation completely outweighs the benefit it did provide. I'm
Example case, iterator ++
temp = current;
current = current->step(source);
source = temp;
Where step is essentially "(a - 0) ^ b", the -0 is a constant literal so it is optimized away (you can check this with your non-stoneage compiler) so what's left is a^b, this is not essentially slower than assignment:
current = current->next;
The "cost" lies in storing a temporary. <:() Interesting observation, since we are doing xor, the output port (-register) in machine language must be a register in most architechtures. This is stored in current, so it makes sense that current in register during all of the graph after register allocation, this block is so small that it would be insane code generator to store register in memory followed by load from same memory. :)
The "temp" can be memory if it is spilled, that is case by case basis, but even so, it should reside in top-of-stack, which in turn should be in cache most of the time, L1 on top of that.
>instead going to posit that your profile results [assuming they are valid, I >can't bring myself to look at that file in any detail after reading through >your list header] are entirely the byproduct of your pooled allocator. Remove >the pooled allocator, and rerun your tests. I'd put money on the standard >library out performing you.
Your claim is easy to satisfy, here is the changes to the code required:
node* node_allocate() { return new node(); }
void node_release(node* p) { delete node; }
How much were you betting, again? I'd like to have it, please! Here is the result:
std::list 70.1866
core::list 60.7829
FYI, the node allocation abstraction being a discrete method isn't accident or pure chance.
Other note is that custom allocator still has more overhead than this arrangement, allocation is typically:
if ( a op b )
return memory_dereference;
Yes, I also tested custom allocator overhead. With std -and- core list implementations. The custom allocator is not enabled at this time because the custom allocator was implemented before placement-new and delete for raw memory.
I yet haven't found time to integrate that branch back in. So unprofessional, I know.
I'll fix this thanks or the feedback & motivation boost. Last year I worked on this maybe a week or two, max, as I went to Philippines for 8 months and worked from there to my parent company (which will change *again* next autumn due re-structuring this time by AMD ;)
FYI, I used to maintain the previous version of the library as fulltime job. Not anymore, now it's a hobby project as I am doing OpenGL drivers daytime and night time, well, censored. >B)
And larger types..
template <typename testdata>
void profile(core::timer* xtimer)
{
profile< std::list<testdata> >( xtimer, "std::list " );
profile< core::list<testdata> >( xtimer, "core::list " );
profile< std::vector<testdata> >( xtimer, "std::vector " );
profile< core::vector<testdata> >( xtimer, "core::vector" );
}
struct intfoo
{
int v[32];
intfoo(int x) { v[0] = x; }
operator int () const { return v[0]; }
};
// invoking..
profile< intfoo >(xtimer);
--
core::list and core::vector - still faster. Infact with larger objects stored the iterator for core::vector makes head to the std::vector's iterator, this is a mystery but observed behaviour. :/
One thing to notice is that I usually tend to create "huge" objects in heap using new or ::create() method using factory design pattern ; pointers to these are stored instead by value. VERY common situation for code I write (reminder: *large* objects).
I only use by-value for POD and trivial/small types mostly, but again that is personal choise forged by years of practise.
One example that springs to mind when I store by value and object can be fairly large is fixed vertex layout when exporting from modeling software. ;)
I want to keep all the relevant data in one place, because I want to store objects of that type (vertex) by value in std::set so that I can build a set of unique vertices (and resolve the vertex indices when adding vertices into the set). VERY fast way to generate nice indexed data from virtually _any_ modeling software for exporting to graphics engine, where indexed primitives are welcome guests.
Now, to the business at hand. First of all, I did use to not be able to use std containers because on console platforms there used to be very minimal c++ support and std just didn't work. Period. We needed containers so rolled own. Now that compilers are finally catching up on c++ features (still missing export on most compilers!), all these fancy constructs are possible.
Infact, even though we can observe some world-class std implementations being repeatedly raped here, I am considering dropping the core::list and core::vector for the reason that I have enough ram these days (2 GB on this machine, I peak use at 2.5 GB or so) PLUS, the std implementations on Visual C++ and g++ are finally getting to standards that are acceptable. ;-)
If it didn't make any difference, I would already done that years ago. But notice: in short of writing own containers, I'd use std ones. I'm moving to those. At first they weren't simply available on platforms I worked on. Then they were crap. Now they are decent.
So: I infact urge that NO ONE here uses the code that is available. I don't guarantee it to be available for download in the future, or that I will keep releasing revised versions in public. I write a lot of code that I discard after a while, because I want to try out ideas that come into my head. I keep the good ideas and discard bad ones without much regret.
But fact is that there was a claim that one shouldn't expect *better* performance when writing own w/o using SSE(3) et cetera. That is already established that isn't true, agree?
template <typename testdata>
void profile(core::timer* xtimer)
{
profile< std::list<testdata> >( xtimer, "std::list " );
profile< core::list<testdata> >( xtimer, "core::list " );
profile< std::vector<testdata> >( xtimer, "std::vector " );
profile< core::vector<testdata> >( xtimer, "core::vector" );
}
struct intfoo
{
int v[32];
intfoo(int x) { v[0] = x; }
operator int () const { return v[0]; }
};
// invoking..
profile< intfoo >(xtimer);
--
core::list and core::vector - still faster. Infact with larger objects stored the iterator for core::vector makes head to the std::vector's iterator, this is a mystery but observed behaviour. :/
One thing to notice is that I usually tend to create "huge" objects in heap using new or ::create() method using factory design pattern ; pointers to these are stored instead by value. VERY common situation for code I write (reminder: *large* objects).
I only use by-value for POD and trivial/small types mostly, but again that is personal choise forged by years of practise.
One example that springs to mind when I store by value and object can be fairly large is fixed vertex layout when exporting from modeling software. ;)
I want to keep all the relevant data in one place, because I want to store objects of that type (vertex) by value in std::set so that I can build a set of unique vertices (and resolve the vertex indices when adding vertices into the set). VERY fast way to generate nice indexed data from virtually _any_ modeling software for exporting to graphics engine, where indexed primitives are welcome guests.
Now, to the business at hand. First of all, I did use to not be able to use std containers because on console platforms there used to be very minimal c++ support and std just didn't work. Period. We needed containers so rolled own. Now that compilers are finally catching up on c++ features (still missing export on most compilers!), all these fancy constructs are possible.
Infact, even though we can observe some world-class std implementations being repeatedly raped here, I am considering dropping the core::list and core::vector for the reason that I have enough ram these days (2 GB on this machine, I peak use at 2.5 GB or so) PLUS, the std implementations on Visual C++ and g++ are finally getting to standards that are acceptable. ;-)
If it didn't make any difference, I would already done that years ago. But notice: in short of writing own containers, I'd use std ones. I'm moving to those. At first they weren't simply available on platforms I worked on. Then they were crap. Now they are decent.
So: I infact urge that NO ONE here uses the code that is available. I don't guarantee it to be available for download in the future, or that I will keep releasing revised versions in public. I write a lot of code that I discard after a while, because I want to try out ideas that come into my head. I keep the good ideas and discard bad ones without much regret.
But fact is that there was a claim that one shouldn't expect *better* performance when writing own w/o using SSE(3) et cetera. That is already established that isn't true, agree?
The thing is, while your project is admirable, that whenever we see someone claiming their implementation is faster, we always suspect that the implementation neglects several features and standards in order to "pass" some selected benchmarks. But SC++L is so much more than that. We must compare the 2 implementations in their entirety. Case in point: sorting a core::list takes more than 2x the time it takes to sort an std::list(in my test, sorting a std::list with 10000 integers takes on average 70ms, whereas doing the same with a core::list takes 180ms). With larger types it's even worse, std seems to be up to 4x faster.
>But SC++L is so much more than that. We must compare the 2 implementations in >their entirety.
Ofcourse. I'm not in the business of writing Standard Library, but when I do something you can bet your money on doing as good job as possible. Even if I go ahead and ditch this code, before that I _will_ complete it as exercise. :)
>Case in point: sorting a core::list takes more than 2x the time it takes to sort an std::list(in my test, sorting a std::list with 10000 integers takes on average 70ms, whereas doing the same with a core::list takes 180ms). With larger types it's even worse, std seems to be up to 4x faster.
Thank you for testing. I never profiled that part ; it is a merge sort and from the code comment you can see that it isn't written by me. It is there for completeness sake only, because cannot use the sort from <algorithm> with list for obvious reasons.
A small note: the merge sort is O(n log n) and that is also reasonable to assume for most (I am tempted to say all ;) std::list::sort implementations. The reason for this is lack of random access iterators for std::list, if anyone has better ideas I am all ears. Anyway, the point is that the devil must be implementation *detail* instead of choise of wrong algorithm. I never really looked at this code with eye on the efficiency, I just wanted it to work correctly.
My curiosity definitely have arisen, if I take a look into this method performance wise, is there any interest if I write here what I find? I mostly looked at it from small-window optimization wise only looking at generated code locally (I don't mean assembly but the dependencies, symbol use, aliasing and the usual stuff you look only half-consciously when writing code -- that stuff is totally on autopilot for me after so many years ;_)
For a complete list of "caveats", here are things I know of:
1. TODO: use allocator (compliance issue with std::list)
2. TODO: merge()
3. NOTE: sort() performance issue ;)
4. NOTE: node::pack() and node::unpack() could be written better
5. NOTE: block allocation (no specific comment? ;)
I want to clarify that the point isn't that "hey use this instead!", but rather take a critical and objective look on the current state-of-the-art std implementations. They are _NOT_ as state of the art as assumed: they can be improved - however - that is not really relevant or interesting as they are *reasonably* efficient and respect the big-O rules imposed in the specification, I see it as a contract between the client and the implementor.
The thing that they have going for is the *correctness*, they have been used in literally thousands of real-world applications and therefore I would dare say that they are virtually bug free (in a sense). That is hard to beat and reason why std::list and std::vector are recommended for minute performance gains.
However, this discussion should be able to demonstrate that there is still room for improvement wish the implementors take it.
For example, ::reverse() method -- if we use "named" pointer fields, we have to actually swap each next+prev pointer in every object in the sequence to reverse it. On the other hand, if we use unnamed pointers and simply toggle a flag which one is which, we can reverse the sequence. In the case of the core::list I merely swap the begin and end terminator nodes. That is counter-example to your case in question, UNLESS the std implementation does unnamed (or indexed pointers).
Another point I want to make is that these "containers" are not what this code is really about and NOT the interesting stuff for me. But I *AM* optimizing freak, when I do something at leisure I go on and on about it and fiddle, experiment, profile and tinker. When I'm working, well, that's a different story: work gets done and main effort is in readability and ease-of-maintenance. Performance comes from architechture not implementation. Only critical stuff is optimized under the microscope and that's only determined after things don't run quite as fast as expected. But it's totally different world to write *asynchronous* code synchronizing working between many processors in C than generic C++..
Ofcourse. I'm not in the business of writing Standard Library, but when I do something you can bet your money on doing as good job as possible. Even if I go ahead and ditch this code, before that I _will_ complete it as exercise. :)
>Case in point: sorting a core::list takes more than 2x the time it takes to sort an std::list(in my test, sorting a std::list with 10000 integers takes on average 70ms, whereas doing the same with a core::list takes 180ms). With larger types it's even worse, std seems to be up to 4x faster.
Thank you for testing. I never profiled that part ; it is a merge sort and from the code comment you can see that it isn't written by me. It is there for completeness sake only, because cannot use the sort from <algorithm> with list for obvious reasons.
A small note: the merge sort is O(n log n) and that is also reasonable to assume for most (I am tempted to say all ;) std::list::sort implementations. The reason for this is lack of random access iterators for std::list, if anyone has better ideas I am all ears. Anyway, the point is that the devil must be implementation *detail* instead of choise of wrong algorithm. I never really looked at this code with eye on the efficiency, I just wanted it to work correctly.
My curiosity definitely have arisen, if I take a look into this method performance wise, is there any interest if I write here what I find? I mostly looked at it from small-window optimization wise only looking at generated code locally (I don't mean assembly but the dependencies, symbol use, aliasing and the usual stuff you look only half-consciously when writing code -- that stuff is totally on autopilot for me after so many years ;_)
For a complete list of "caveats", here are things I know of:
1. TODO: use allocator (compliance issue with std::list)
2. TODO: merge()
3. NOTE: sort() performance issue ;)
4. NOTE: node::pack() and node::unpack() could be written better
5. NOTE: block allocation (no specific comment? ;)
I want to clarify that the point isn't that "hey use this instead!", but rather take a critical and objective look on the current state-of-the-art std implementations. They are _NOT_ as state of the art as assumed: they can be improved - however - that is not really relevant or interesting as they are *reasonably* efficient and respect the big-O rules imposed in the specification, I see it as a contract between the client and the implementor.
The thing that they have going for is the *correctness*, they have been used in literally thousands of real-world applications and therefore I would dare say that they are virtually bug free (in a sense). That is hard to beat and reason why std::list and std::vector are recommended for minute performance gains.
However, this discussion should be able to demonstrate that there is still room for improvement wish the implementors take it.
For example, ::reverse() method -- if we use "named" pointer fields, we have to actually swap each next+prev pointer in every object in the sequence to reverse it. On the other hand, if we use unnamed pointers and simply toggle a flag which one is which, we can reverse the sequence. In the case of the core::list I merely swap the begin and end terminator nodes. That is counter-example to your case in question, UNLESS the std implementation does unnamed (or indexed pointers).
Another point I want to make is that these "containers" are not what this code is really about and NOT the interesting stuff for me. But I *AM* optimizing freak, when I do something at leisure I go on and on about it and fiddle, experiment, profile and tinker. When I'm working, well, that's a different story: work gets done and main effort is in readability and ease-of-maintenance. Performance comes from architechture not implementation. Only critical stuff is optimized under the microscope and that's only determined after things don't run quite as fast as expected. But it's totally different world to write *asynchronous* code synchronizing working between many processors in C than generic C++..
One small correction: The tests were made with 100,000 elements, not 10,000 as I said before. In fact, with a smaller amount of elements the 2 implementations seem to have pretty much the same performance, and sometimes yours seems faster.
The std::list<>::sort() was actually fairly easy to defeat. ;-p
Here's How:
static bool kludge_predicate(node* a, node* b)
{
return a->reference() < b->reference();
}
template <typename ordering>
void sort(ordering order) // <-- not used in this proto version ;(
{
core::vector<node*> nv;
nv.resize( size() );
node** ptr = nv;
iterator i = begin();
iterator x = end();
for ( ; i != x; ++i ) *ptr++ = i.node();
std::sort(nv.begin(),nv.end(),kludge_predicate);
clear();
i = begin();
for ( int j=0; j<nv.size(); ++j )
i = insert(i,nv[j]->reference());
}
Cheating! We create vector of node pointers, initialize each pointer with one from the list container. Then we sort (!) this vector using predicate which dereferences the pointers so that the values pointed-to are compared.
Last, we assign the sorted nodes back into the list. This is _very_ lazy implementation without any optimization that come to mind (except cheating!), about 10-15% faster than the competition. ;)
The insert() loop in the end works OK since the nodes have already been allocated (and freed into the pool by ::clear() ), I think it would be much more efficient to iterate the existing sequence and update the "mask" as iterating but atleast got the primary task complete already. Heh.
This does burn size()*sizeof(node*) bytes of memory for the scope of the sort() method though. I'll see if I bother to squeeze this little bit further.. thanks for pointing out the room for improvement. ;-p
Here's How:
static bool kludge_predicate(node* a, node* b)
{
return a->reference() < b->reference();
}
template <typename ordering>
void sort(ordering order) // <-- not used in this proto version ;(
{
core::vector<node*> nv;
nv.resize( size() );
node** ptr = nv;
iterator i = begin();
iterator x = end();
for ( ; i != x; ++i ) *ptr++ = i.node();
std::sort(nv.begin(),nv.end(),kludge_predicate);
clear();
i = begin();
for ( int j=0; j<nv.size(); ++j )
i = insert(i,nv[j]->reference());
}
Cheating! We create vector of node pointers, initialize each pointer with one from the list container. Then we sort (!) this vector using predicate which dereferences the pointers so that the values pointed-to are compared.
Last, we assign the sorted nodes back into the list. This is _very_ lazy implementation without any optimization that come to mind (except cheating!), about 10-15% faster than the competition. ;)
The insert() loop in the end works OK since the nodes have already been allocated (and freed into the pool by ::clear() ), I think it would be much more efficient to iterate the existing sequence and update the "mask" as iterating but atleast got the primary task complete already. Heh.
This does burn size()*sizeof(node*) bytes of memory for the scope of the sort() method though. I'll see if I bother to squeeze this little bit further.. thanks for pointing out the room for improvement. ;-p
Note! We also copy the contained objects back into the list BY VALUE, with 128 byte sized objects we were still 10-15% faster ; with the "int" objects I don't even bother telling the difference.
This is the reason why updating the internal node sequence would make a lot of optimization sense. I always got too much on mind at once to write coherent online stuff, usually I write and edit documents I create so they shape up readable in the end. I suck in posting etc. :(
This is the reason why updating the internal node sequence would make a lot of optimization sense. I always got too much on mind at once to write coherent online stuff, usually I write and edit documents I create so they shape up readable in the end. I suck in posting etc. :(
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement