Comparison of STL list performance

Started by
29 comments, last by SeraphLance 9 years, 2 months ago

Investigate the ASM output to find out.

I suspect you're actually measuring the time to do 2 million malloc in your C code compared to ~1 million std::list.

Advertisement

Using a pointer to the data to be held in each list node made all the difference. Now it's as slow as the C90 version.

Test 2: STL list vs. C linked list

C90 g=3000000 passed push,iterate+call,pop in 109,0,47 ms

CPP g=3000000 passed push,iterate+call,pop in 47,16,15 ms

CPP with pointer data g=3000000 passed push,iterate+call,pop in 109,0,47 ms

I'm not sure if I have this exactly right, but I think std::list is not really implemented as a linked list, so your comparison to linked lists makes no sense.

It does act like a doubly-linked list, but under the hood, it uses a little trick to reduce the number of pointer dereferencing operations which would be required for a straight-forward doubly-linked list implementation.

Instead, I think std::list uses multiple buckets (memory buffers which hold multiple elements), same as std::string and other sequence types, and instead of keeping "Next" and "Prev" pointers in each element (or Node), it only has to keep track of the memory buckets this way, and any operations it does on elements inside a bucket do not have to use the "Next" and "Prev" pointers to move between elements, because those elements are contiguous - they're accessed simply by index.

This is just a simple view. In reality, std::list is much more complicated than this.


Using a pointer to the data to be held in each list node made all the difference. Now it's as slow as the C90 version.

That falls into under the "cache-thrashing to cache-friendly" clause then. The extra indirection has to reach out to different memory, and that puts more pressure on the cache and might be preventing the prefetcher from doing its thing too.

This is a good example of why C90 doesn't get you any free mojo -- the way to do a generic linked list in C is with void pointers, that's what its likely doing here. In C++, through templates, you can specialize the behavior of a data structure based on properties of its type parameter, while still presenting the same interface to the client code. Here, the STL version was able to either store the value itself inside the node (if the value type is small) or use contiguous blocks of memory. The STL uses "tricks" like this everywhere to provide better general-case performance when the type in question is amenable to it. std::shared_ptr does the same thing -- small data values are stored inside the shared_ptr control block and this saves a pointer indirection a cache penalty like you are seeing ins std::list, but if the data value is large, it uses a control block that has a pointer to an external data value stored somewhere else. This optimization is enabled when you use std::make_shared, and uses type-erasure to be entirely transparent to the programmer.

Behold the power of templates. To achieve similar performance in C, you'd have to write a separate list-node for each small data type you wanted to use, and the whole set of linked-list functions to operate on it.

throw table_exception("(? ???)? ? ???");


There's no good reason in 2015 to switch to C from C++. Anyone claiming otherwise is an old-school programmer who hasn't paid much attention to compiler or language technology in the last decade.

Are you talking about game development or general development? Because there are plenty of reasons people use C instead of C++ in 2015, it's just not generally the case in game development because we need to treat a lot of information and resources, and the facilities that C++ has over C are very handy there.

But yeah, people that use C don't do it to outperform std::list because that would be a complete waste of time. Instead of switching to C, I'd rather spend that time doing like EA and making my own version of the STL containers, with the features I find superfluous dropped out...


Because there are plenty of reasons people use C instead of C++ in 2015,

And those reasons have always been those reasons -- If you have those reasons, you were never using C++ in the first place, hence no need to switch. C has its place, but I struggle to think of an instance where it makes sense to swtich to C from where C++ has been used.

throw table_exception("(? ???)? ? ???");

Behold the power of templates. To achieve similar performance in C, you'd have to write a separate list-node for each small data type you wanted to use, and the whole set of linked-list functions to operate on it.

Not necessarily. I can conceive of doing some sort of intrusive container using zero-length arrays. That might seem a little smelly, and you'd have to pass the size of the node type into the node allocator you were using, but I think you could make it work. Totally off the top of my head, no compiler has touched this code so it may contain typos:

[source]

/* intrusive linked list node - note the zero-length array */

typedef struct _IntrusiveNode {

_IntrusiveNode* next;

char data[0];

} INTRUSIVE_LIST;

INTRUSIVE_LIST* push_uninitialized_node(INTRUSIVE_LIST* head, size_t data_size) {

INTRUSIVE_LIST* new_head;

new_head = malloc(sizeof(INTRUSIVE_LIST) + data_size);

new_head->next = head;

return new_head;

}

INTRUSIVE_LIST* push_node(INTRUSIVE_LIST* head, void* data, size_t data_size) {

INTRUSIVE_LIST* new_head;

new_head = malloc(sizeof(INTRUSIVE_LIST) + data_size);

new_head->next = head;

memcpy(new_head->data, data, data_size);

return new_head;

}

/* helper macros */

#define INTRUSIVE_LIST_PUSH(h,d) push_node(h,d,sizeof(*d))

#define INTRUSIVE_LIST_PUSH_UNINITIALIZED(h, s) push_uninitialized_node(h, sizeof(s))

/* usage */

head = INTRUSIVE_LIST_PUSH_UNINITIALIZED(NULL, int);

/* look, I can even mix and match data types!

float data = 0.0f;

head = INTRUSIVE_LIST_PUSH(head, &data);

[/source]

Granted,you now have a memory copy step if you want it initialized, but you could elide that if you just wanted to emplace an uninitialized node on the list (as I've shown). Oh, and you don't have the typesafety of the templated version.

Can anyone find anything wrong with this?

edit: But really, depending on your platform, you shouldn't be using linked lists, anyway.


Instead, I think std::list uses multiple buckets (memory buffers which hold multiple elements), same as std::string and other sequence types, and instead of keeping "Next" and "Prev" pointers in each element (or Node), it only has to keep track of the memory buckets this way, and any operations it does on elements inside a bucket do not have to use the "Next" and "Prev" pointers to move between elements, because those elements are contiguous - they're accessed simply by index.

I believe you're thinking of the std::deque, which acts somewhat like a vector, but isn't guaranteed to be contiguous, and which guarantees constant complexity for pushing and popping from the front and back. In regards to the std::list, I'm not 100% sure, but I believe Visual Studio's implementation, and most other implementations, use a fairly standard doubly-linked list. Visual Studio has a debugging setting to show the raw structure of objects in the watch window, if you want to see who is right.

I certainly don't mean to contradict any of the good points raised for real applications, but the particular synthetic "benchmark" posted is much much simpler, and can be pretty nicely explained by comparing the running time of the following for-loops.


for(i=0; i < 1000*1000*10; i++) {
  delete (new int);
}

for(i=0; i < 1000*1000*10; i++) {
  free(malloc(4));
  free(malloc(4));
}

What do you mean, Erik? I think there would be more allocations/deallocations in the c++ version due to copy constructor (if the compiler doesn't compile that out). There's also if statements in pushing and iterating and popping.

The timings were more shaky this time around, varying by as much as 90 ms. But I tried to get the best of 5 tries.

By the way, zero-length arrays seem to make the C custom linked list implementation as fast as the STL version. Only 1 allocation is required for the linked node together with the data, and one deallocation. Plus maybe no jumping in memory because of a data pointer, allowing it all to be fetched into the cache.

This topic is closed to new replies.

Advertisement