I think there would be more allocations/deallocations in the c++ version due to copy constructor (if the compiler doesn't compile that out).
In your custom implementation, you first malloc() the 'Widget', and then you malloc a 'ListLink' to store it in, 2 allocations for each node.
std::list will create an internal type that is 'sizeof(Widget) + sizeof(ListLink)' and do one single allocation per node.
That's why you saw the same performance when you changed the std::list store pointers that were allocated with new, as that version also had 2 allocations per node with that setup, 'new Widget' and new 'sizeof(pointer) + sizeof(ListLink)'.
Try timing the loop with 10 million allocations only without any list, and I think you will find that it's not too far from the time of the allocation-step in your list-test.
If you want to count the number of allocations for a C++ implementation, you can override operator new and increment a counter before calling malloc from it.
int g_countNew = 0;
int g_countDelete = 0;
void* operator new(size_t size) {
++g_countNew;
return malloc(size);
}
void operator delete(void *mem) {
++g_countDelete;
free(mem);
}