Custom Allocator Efficiency Comparison

Started by
3 comments, last by Hodgman 11 years, 12 months ago
Hey Guys,

I made a custom allocator (based on Doug Lee allocator) and a utility library for C++.
When I use the custom allocator I am not getting a lot of difference when compared to the windows default malloc and free.

This is my result on the release version of the program run from console.


########Windows Allocator########
Sample 0 Time: 56.000000 ms
Sample 1 Time: 52.000000 ms
Sample 2 Time: 46.000000 ms
Sample 3 Time: 55.000000 ms
Sample 4 Time: 54.000000 ms
Sample 5 Time: 52.000000 ms
Sample 6 Time: 46.000000 ms
Sample 7 Time: 54.000000 ms
Sample 8 Time: 51.000000 ms
Sample 9 Time: 46.000000 ms
Sample 10 Time: 58.000000 ms
Sample 11 Time: 46.000000 ms
Sample 12 Time: 48.000000 ms
Sample 13 Time: 46.000000 ms
Sample 14 Time: 50.000000 ms
Sample 15 Time: 55.000000 ms
Sample 16 Time: 56.000000 ms
Sample 17 Time: 45.000000 ms
Sample 18 Time: 47.000000 ms
Sample 19 Time: 45.000000 ms
Sample 20 Time: 47.000000 ms
Sample 21 Time: 56.000000 ms
Sample 22 Time: 47.000000 ms
Sample 23 Time: 45.000000 ms
Sample 24 Time: 46.000000 ms
Sample 25 Time: 57.000000 ms
Sample 26 Time: 57.000000 ms
Sample 27 Time: 49.000000 ms
Sample 28 Time: 56.000000 ms
Sample 29 Time: 46.000000 ms


########Custom Allocator########
Sample 0 Time: 47.000000 ms
Sample 1 Time: 44.000000 ms
Sample 2 Time: 39.000000 ms
Sample 3 Time: 40.000000 ms
Sample 4 Time: 38.000000 ms
Sample 5 Time: 39.000000 ms
Sample 6 Time: 38.000000 ms
Sample 7 Time: 40.000000 ms
Sample 8 Time: 40.000000 ms
Sample 9 Time: 38.000000 ms
Sample 10 Time: 42.000000 ms
Sample 11 Time: 39.000000 ms
Sample 12 Time: 38.000000 ms
Sample 13 Time: 45.000000 ms
Sample 14 Time: 44.000000 ms
Sample 15 Time: 40.000000 ms
Sample 16 Time: 37.000000 ms
Sample 17 Time: 38.000000 ms
Sample 18 Time: 45.000000 ms
Sample 19 Time: 43.000000 ms
Sample 20 Time: 46.000000 ms
Sample 21 Time: 44.000000 ms
Sample 22 Time: 45.000000 ms
Sample 23 Time: 37.000000 ms
Sample 24 Time: 45.000000 ms
Sample 25 Time: 42.000000 ms
Sample 26 Time: 39.000000 ms
Sample 27 Time: 41.000000 ms
Sample 28 Time: 38.000000 ms
Sample 29 Time: 42.000000 ms

For the custom allocator
allocation - O(number of free chunks)
delete - O(1)

Does any one else have any information regarding general statistics for custom allocators they have used as compared to the native one ?
Any information is greatly appreciated..

I included the test file for reference.

Thanks,
Jerry
JerrinX
Advertisement
I would expect Microsoft/GCC/etc.'s malloc to perform almost as well as something like dlmalloc. Actually, I wouldn't be surprised if they were actually using dlmalloc themselves!

Not having your allocator/list code, I obviously can't run the same test myself, but I copied your code roughly to these results:[source]########Standard List########
Sample 0 Time: 174.658611 ms
Sample 1 Time: 178.035155 ms
Sample 2 Time: 170.639098 ms
Sample 3 Time: 169.447884 ms
Sample 4 Time: 171.300843 ms
Sample 5 Time: 171.426281 ms
Sample 6 Time: 170.957386 ms
Sample 7 Time: 171.255186 ms
Sample 8 Time: 171.174973 ms
Sample 9 Time: 171.274394 ms
Sample 10 Time: 172.276601 ms
Sample 11 Time: 173.462272 ms
Sample 12 Time: 172.348276 ms
Sample 13 Time: 172.015920 ms
Sample 14 Time: 171.314925 ms
Sample 15 Time: 170.047760 ms
Sample 16 Time: 172.620490 ms
Sample 17 Time: 171.961725 ms
Sample 18 Time: 170.538411 ms
Sample 19 Time: 172.772378 ms
Sample 20 Time: 171.157062 ms
Sample 21 Time: 170.712486 ms
Sample 22 Time: 171.153218 ms
Sample 23 Time: 171.347350 ms
Sample 24 Time: 171.799183 ms
Sample 25 Time: 170.436874 ms
Sample 26 Time: 169.666752 ms
Sample 27 Time: 170.434311 ms
Sample 28 Time: 171.126768 ms
Sample 29 Time: 169.729471 ms

########Malloc/Free########
Sample 0 Time: 72.385743 ms
Sample 1 Time: 77.789746 ms
Sample 2 Time: 78.608923 ms
Sample 3 Time: 76.863475 ms
Sample 4 Time: 73.470719 ms
Sample 5 Time: 74.614152 ms
Sample 6 Time: 73.470294 ms
Sample 7 Time: 73.288970 ms
Sample 8 Time: 74.126065 ms
Sample 9 Time: 73.396482 ms
Sample 10 Time: 74.125208 ms
Sample 11 Time: 73.187426 ms
Sample 12 Time: 73.719032 ms
Sample 13 Time: 72.559386 ms
Sample 14 Time: 75.735405 ms
Sample 15 Time: 78.938723 ms
Sample 16 Time: 78.135759 ms
Sample 17 Time: 78.236453 ms
Sample 18 Time: 79.052642 ms
Sample 19 Time: 75.144060 ms
Sample 20 Time: 73.745914 ms
Sample 21 Time: 73.770232 ms
Sample 22 Time: 74.688390 ms
Sample 23 Time: 74.539915 ms
Sample 24 Time: 75.108223 ms
Sample 25 Time: 74.743003 ms
Sample 26 Time: 74.312508 ms
Sample 27 Time: 73.611088 ms
Sample 28 Time: 73.832951 ms
Sample 29 Time: 104.177885 ms

########New/Delete########
Sample 0 Time: 106.053881 ms
Sample 1 Time: 81.357852 ms
Sample 2 Time: 84.248006 ms
Sample 3 Time: 79.414867 ms
Sample 4 Time: 77.739395 ms
Sample 5 Time: 77.777795 ms
Sample 6 Time: 78.748859 ms
Sample 7 Time: 80.961064 ms
Sample 8 Time: 78.002214 ms
Sample 9 Time: 77.532895 ms
Sample 10 Time: 77.698864 ms
Sample 11 Time: 79.061598 ms
Sample 12 Time: 81.683390 ms
Sample 13 Time: 80.320232 ms
Sample 14 Time: 80.953382 ms
Sample 15 Time: 80.176875 ms
Sample 16 Time: 80.615900 ms
Sample 17 Time: 78.425460 ms
Sample 18 Time: 79.089329 ms
Sample 19 Time: 78.178428 ms
Sample 20 Time: 78.876004 ms
Sample 21 Time: 76.929182 ms
Sample 22 Time: 78.069203 ms
Sample 23 Time: 77.831559 ms
Sample 24 Time: 77.523082 ms
Sample 25 Time: 77.952728 ms
Sample 26 Time: 80.115438 ms
Sample 27 Time: 78.923792 ms
Sample 28 Time: 78.023121 ms
Sample 29 Time: 78.078590 ms

########Custom New########
Sample 0 Time: 57.033867 ms
Sample 1 Time: 57.575718 ms
Sample 2 Time: 57.030451 ms
Sample 3 Time: 58.402147 ms
Sample 4 Time: 59.323721 ms
Sample 5 Time: 59.077967 ms
Sample 6 Time: 58.853120 ms
Sample 7 Time: 59.006289 ms
Sample 8 Time: 57.555236 ms
Sample 9 Time: 58.053996 ms
Sample 10 Time: 56.847844 ms
Sample 11 Time: 56.901176 ms
Sample 12 Time: 57.571024 ms
Sample 13 Time: 57.548411 ms
Sample 14 Time: 56.950241 ms
Sample 15 Time: 57.026185 ms
Sample 16 Time: 56.919523 ms
Sample 17 Time: 56.417350 ms
Sample 18 Time: 57.097863 ms
Sample 19 Time: 57.543717 ms
Sample 20 Time: 57.553958 ms
Sample 21 Time: 61.106279 ms
Sample 22 Time: 61.448030 ms
Sample 23 Time: 60.788851 ms
Sample 24 Time: 62.251419 ms
Sample 25 Time: 64.147048 ms
Sample 26 Time: 57.912774 ms
Sample 27 Time: 57.066292 ms
Sample 28 Time: 56.633238 ms
Sample 29 Time: 60.453072 ms

########Custom Malloc########
Sample 0 Time: 25.684588 ms
Sample 1 Time: 25.197776 ms
Sample 2 Time: 24.875225 ms
Sample 3 Time: 24.606859 ms
Sample 4 Time: 24.524942 ms
Sample 5 Time: 24.430651 ms
Sample 6 Time: 24.728457 ms
Sample 7 Time: 24.803974 ms
Sample 8 Time: 24.011251 ms
Sample 9 Time: 24.533048 ms
Sample 10 Time: 24.780935 ms
Sample 11 Time: 24.725469 ms
Sample 12 Time: 24.698591 ms
Sample 13 Time: 24.162285 ms
Sample 14 Time: 24.635872 ms
Sample 15 Time: 24.655499 ms
Sample 16 Time: 25.093244 ms
Sample 17 Time: 24.769414 ms
Sample 18 Time: 24.684938 ms
Sample 19 Time: 24.730589 ms
Sample 20 Time: 26.535336 ms
Sample 21 Time: 25.872743 ms
Sample 22 Time: 25.338145 ms
Sample 23 Time: 24.931969 ms
Sample 24 Time: 24.574434 ms
Sample 25 Time: 24.454543 ms
Sample 26 Time: 24.661472 ms
Sample 27 Time: 24.692191 ms
Sample 28 Time: 24.640139 ms
Sample 29 Time: 24.202392 ms[/source][source lang=cpp]StackAlloc alloc(scratch, ((u8*)scratch)+scratchSize);
Scope a(alloc, "");
Timer* timer = eiNew(a, Timer)();
printf("\n\n########Standard List########");
for(int itrSample = 0; itrSample < 30; itrSample++)
{
double begin = timer->Elapsed();
for(int itrItr = 0; itrItr < 100; itrItr++)
{
std::list<int> lst;
for(int itr = 0; itr < 5000; itr++)
{
lst.push_front(itr);
}
}
double end = timer->Elapsed();
printf("\nSample %d Time: %f ms", itrSample, float(end - begin)*1000.0f);
}

struct ListItem
{
int data;
ListItem* next;
};

printf("\n\n########Malloc/Free########");
for(int itrSample = 0; itrSample < 30; itrSample++)
{
double begin = timer->Elapsed();
for(int itrItr = 0; itrItr < 100; itrItr++)
{
ListItem* lst = 0;
for(int itr = 0; itr < 5000; itr++)
{
ListItem* push = (ListItem*)malloc(sizeof(ListItem));
push->next = lst;
push->data = itr;
lst = push;
}

while(lst)
{
ListItem* dead = lst;
lst = dead->next;
free(dead);
}
}
double end = timer->Elapsed();
printf("\nSample %d Time: %f ms", itrSample, float(end - begin)*1000.0f);
}

printf("\n\n########New/Delete########");
for(int itrSample = 0; itrSample < 30; itrSample++)
{
double begin = timer->Elapsed();
for(int itrItr = 0; itrItr < 100; itrItr++)
{
ListItem* lst = 0;
for(int itr = 0; itr < 5000; itr++)
{
ListItem* push = new ListItem;
push->next = lst;
push->data = itr;
lst = push;
}

while(lst)
{
ListItem* dead = lst;
lst = dead->next;
delete dead;
}
}
double end = timer->Elapsed();
printf("\nSample %d Time: %f ms", itrSample, float(end - begin)*1000.0f);
}

printf("\n\n########Custom New########");
for(int itrSample = 0; itrSample < 30; itrSample++)
{
double begin = timer->Elapsed();
for(int itrItr = 0; itrItr < 100; itrItr++)
{
Scope t(a,"");
ListItem* lst = 0;
for(int itr = 0; itr < 5000; itr++)
{
ListItem* push = eiNew(t, ListItem);
push->next = lst;
push->data = itr;
lst = push;
}
}
double end = timer->Elapsed();
printf("\nSample %d Time: %f ms", itrSample, float(end - begin)*1000.0f);
}

printf("\n\n########Custom Malloc########");
for(int itrSample = 0; itrSample < 30; itrSample++)
{
double begin = timer->Elapsed();
for(int itrItr = 0; itrItr < 100; itrItr++)
{
Scope t(a,"");
ListItem* lst = 0;
for(int itr = 0; itr < 5000; itr++)
{
ListItem* push = eiAlloc(t, ListItem);
push->next = lst;
push->data = itr;
lst = push;
}
}
double end = timer->Elapsed();
printf("\nSample %d Time: %f ms", itrSample, float(end - begin)*1000.0f);
}[/source]
The general-purpose allocators in the wild are extremely hard to improve upon, in my experience. The Windows NT low-fragmentation heap allocator in particular is insanely good at its job.

Where custom allocators come in handy is in a couple of niche situations:

- You're on a platform with no (or bad) default allocators, such as a console
- You're not using general-purpose allocation patterns


In the first case, custom libraries like dlmalloc are often deployed in certain configurations because the alternatives are terrible. If you're programming against Windows (and to a lesser extent most Linux/BSD flavors) you have no need to worry about general-purpose allocation.

The second case is IMHO the more interesting one. For instance, if you know your allocation pattern looks like "allocate ten million tiny things then throw them all away at once" you can write a custom allocator that'll be an order of magnitude faster than a general-purpose allocator easily. Stack allocators are often useful, one-way allocators as mentioned, and so on. There are a handful of such patterns that can be handy to tap if you know what to look for.


There's one other reason to screw with allocators in my book, and that's instrumentation. It can be extremely handy in a large, complex code base to get comprehensive dumps of who's using what memory. Thankfully, there are increasingly good tools for this on most platforms these days, so rolling your own is rarely necessary any more.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Wow !! Thanks a lot Hodgman and ApochiPiQ for that useful update !!
Really awesome you guys replied so fast !!

Hey Hodgman ur getting some pretty impressive results.

From the code you posted I am guessing it is a stack allocator ? Am I right ?
I think it is better to compare the windows allocator to a pool based one though.
Also are you running on release on console ?

I attached the allocator here. Check out if you have time.
I recommend using a fragment size of 4 Bytes on 32 Bit machine

Thanks again !
Jerry
JerrinX
From the code you posted I am guessing it is a stack allocator ? Am I right ?
I think it is better to compare the windows allocator to a pool based one though.
Also are you running on release on console ?
Yep that's a stack allocator (the lifetime of the allocations are tied to the lifetime of the [font=courier new,courier,monospace]Scope t[/font] variable) so it's not really a fair comparison, but then again, the almost my entire engine is allocated from stack allocators... wink.png Both my "custom new" and "custom malloc" tests are using the stack allocator, but the speed difference is that the first test also allocates a linked-list of destructor callbacks, whereas the second test skips all the data required to call destructors correctly when the scope object dies.

I tried using my fixed-size pool allocator in the same test (again, not quite a fair comparison with a general-purpose allocator) and got even faster results than my stack allocator though![source]########Pool########
Sample 0 Time: 18.695643 ms
Sample 1 Time: 18.929025 ms
Sample 2 Time: 18.866306 ms
Sample 3 Time: 19.606981 ms
Sample 4 Time: 19.088168 ms
Sample 5 Time: 18.805722 ms
Sample 6 Time: 18.731056 ms
Sample 7 Time: 18.869294 ms
Sample 8 Time: 18.873986 ms
Sample 9 Time: 19.106515 ms
Sample 10 Time: 19.028010 ms
Sample 11 Time: 19.021183 ms
Sample 12 Time: 18.926892 ms
Sample 13 Time: 19.299362 ms
Sample 14 Time: 19.043369 ms
Sample 15 Time: 18.762629 ms
Sample 16 Time: 18.997716 ms
Sample 17 Time: 18.873986 ms
Sample 18 Time: 18.857347 ms
Sample 19 Time: 18.777136 ms
Sample 20 Time: 19.036969 ms
Sample 21 Time: 18.784815 ms
Sample 22 Time: 19.121021 ms
Sample 23 Time: 19.129980 ms
Sample 24 Time: 18.948652 ms
Sample 25 Time: 19.180752 ms
Sample 26 Time: 18.912386 ms
Sample 27 Time: 19.054888 ms
Sample 28 Time: 18.936278 ms
Sample 29 Time: 19.156005 ms[/source]
I attached the allocator here. Check out if you have time.[/quote]I tried to use it, but I'm using a different compiler than you -- the lib file was incompatible.

I've uploaded the source for my stack, pool and scope classes if you're interested.

This topic is closed to new replies.

Advertisement