I would strongly encourage anyone on this thread to read the paper I wrote on this topic over a decade ago. It just won a Most Influential Paper award but its influence has clearly not spread to this domain...yet.
TL;DR - a good malloc is often as fast / faster than your custom allocator because it does the same tricks; "region" allocators can be faster but can leak tons of memory.
Title: Reconsidering Custom Memory Allocation (ACM link, direct PDF link, Powerpoint talk slides), OOPSLA 2002. I've attached the slides in PPT and PDF formats; I highly recommend looking at the PPT version, since it has animations that do not translate well to PDF.
Abstract:
Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (improvements of up to 44%). Regions also reduce programmer burden and eliminate a source of memory leaks. However, we show that the inability of programmers to free individual objects within regions can lead to a substantial increase in memory consumption. Worse, this limitation precludes the use of regions for common programming idioms, reducing their usefulness.We present a generalization of general-purpose and region-based allocators that we call reaps. Reaps are a combination of regions and heaps, providing a full range of region semantics with the addition of individual object deletion. We show that our implementation of reaps provides high performance, outperforming other allocators with region-like semantics. We then use a case study to demonstrate the space advantages and software engineering benefits of reaps in practice. Our results indicate that programmers needing fast regions should use reaps, and that most programmers considering custom allocators should instead use the Lea allocator.
StackOverflow discussion here.