I've fiddled some with this issue like a year ago, and the gain was around 20x the std new/delete in speed (and x5 smaller memory consumption with std::list<void*>) with single threaded profiling. It wasn't for generic use and meant for use with multithreading. Though I didn't finish it, I'll share some experience..
First you have the data, which you will want to divide into fixed size pages. I had aligned pages of 64k elements with element size of 4, 16 and 64 bytes.
For each page you need to track which elements are used or free. I call this free index list. You want to separate the index list and the data it points to, so you dont have to load data in cache.
The control data and the actual free index list contents was separated for better cache usage. With control data I mean fast way to tell free count for a free index list (to block of memory). I had collection of these index lists and collection of their control data. So when you iterate the collection of these list, multiple lists are in one cache line.
You want to divide these pages into collections of full, empty and partially used. When you allocate and search for free space you can skip full pages. When you deallocate you can skip the empty pages. Also don't sync the pages on every deallocation, since most propably you're in std::list destructor and you are deallocating alot in a row.
The data pages with different sizes was to reduce the overhead with the free index lists. For larger allocations you use larger pages.
The free index lists are implemented by varying sized specialized hand optimized lists, all using the following code as the base. This is 32 element single linked list which can only hold integers 0 - 31. It is branchless and always ordered, and fits in a 32bit register so you can iterate it without no memory operations. Includes are not included..
#include "popcount.h"#include <cassert>#define PT_INLINE __forceinline#define PT_ASSUME(e) (__assume(e))typedef unsigned char uint8;typedef unsigned short uint16;typedef unsigned int uint32;/// List of integers, fixed capacity of 32 elements.////// Consumes 4 bytes of memory, each node is one bit. Whole list fits in a 32/// bit register to enable compiler to apply heavy optimizations.////// All member functions are branchless, no if's or loops./// Inlines to few machine instructions. ie constructor is single OR or XOR/// instruction. There isn't any, or very few immediate constant loads.///class fixed_int_list_32{public: typedef uint32 value_type; enum filled_type{ filled }; class iterator; /// Constructs new empty list. fixed_int_list_32(); /// Constructs new list filled with indices 0 to 31. fixed_int_list_32(filled_type); bool empty() const; bool full() const; uint8 size() const; iterator begin(); iterator end(); iterator iterator_at(uint32 index); void insert(uint32 index); void insert_2(value_type start_index); void insert_4(value_type start_index); /// Returns true if next 2 elements from index are available. bool next_2_available(value_type index) const; /// Returns true if next 4 elements from index are available. bool next_4_available(value_type index) const; /// Returns iterator which is greater than or equal to \a index. iterator find_ge(uint32 index) const; iterator erase(uint32 index); iterator erase(iterator it); void pop_front(); void pop_back();private:private: // data members uint32 m_data;};class fixed_int_list_32::iterator{public: typedef uint32 value_type; iterator(uint32 index, uint32 const& data) : m_data(&data) , m_index(index) { } value_type operator*() const { return static_cast<value_type>(m_index); } iterator& operator++() { // Index of 32 is end iterator, and it will zero out the data at // shift. uint32 const data = *m_data; // Shift out one past the current bit towards LSB. // For some reason >> 32u doesnt clear MSB, so mask out MSB. uint32 const shifted_data = (data >> (m_index + 1u)) & ~uint32(0x80000000); // Count zeros to skip. uint32 const skip_count = trailing_zero_count(shifted_data); // m_index = (skip_count < 32) ? index + 1 + skip_count : 32; uint32 const next_index = m_index + 1u + skip_count; int32 const a = 31 - skip_count; // index = (a >= 0) ? next_index : 32; m_index = select(a, next_index, 32); return *this; } bool operator!=(iterator const& other) const { return m_index != other.m_index || m_data != other.m_data; } bool operator==(iterator const& other) const { return m_index == other.m_index && m_data == other.m_data; } bool is_end() const { return m_index == 32u; }private: // data members uint32 const* __restrict m_data; uint32 m_index;};// .inlinline fixed_int_list_32::fixed_int_list_32(): m_data(0){}inline fixed_int_list_32::fixed_int_list_32(filled_type): m_data(~static_cast<uint32>(0)){}inline bool fixed_int_list_32::empty() const{ return m_data == 0;}inline bool fixed_int_list_32::full() const{ return m_data == ~static_cast<uint32>(0);}inline uint8 fixed_int_list_32::size() const{ return popcount_wiki_3(m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::begin(){ uint32 const first_bit_index = trailing_zero_count(m_data); return iterator(first_bit_index, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::end(){ return iterator(32, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::iterator_at(uint32 index){ return iterator(index, m_data);}inline void fixed_int_list_32::insert(uint32 index){ PT_ASSUME(index < 32); m_data |= 1 << index; // set bit}inline void fixed_int_list_32::insert_2(value_type start_index){ PT_ASSUME(start_index < 31); m_data |= 3 << start_index;}inline void fixed_int_list_32::insert_4(value_type start_index){ PT_ASSUME(start_index < 29); m_data |= 0x0f << start_index;}inline bool fixed_int_list_32::next_2_available(value_type index) const{ // return true if two bits are set return ((m_data >> index) & 3u) == 3u;}inline bool fixed_int_list_32::next_4_available(value_type index) const{ // return true if four bits are set return ((m_data >> index) & 15u) == 15u;}/// Returns iterator which is greater than or equal to \a index.inline fixed_int_list_32::iterator fixed_int_list_32::find_ge(uint32 index) const{ assert(index <= 32); PT_ASSUME(index <= 32); // on an i386 "x shr/shl 32" = "x" // ie m_data >> 32 must have msb masked out. uint32 const mask = 0xffffffff >> (index - 1); uint32 const masked_bits = (m_data >> index) & mask; uint32 const first_bit_index = trailing_zero_count(masked_bits); int32 const next_index = index + first_bit_index; uint32 const clamped_next_index = branchless_min(next_index, int32(32)); return iterator(clamped_next_index, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::erase(uint32 index){ PT_ASSUME(index < 32); fixed_int_list_32::iterator const next = find_ge(index + 1); m_data &= ~(1 << index); // clear bit return next;}inline fixed_int_list_32::iterator fixed_int_list_32::erase(iterator it){ uint32 const index = *it; ++it; m_data &= ~(1 << index); // clear bit return it;}inline void fixed_int_list_32::pop_front(){ uint32 const first_bit_index = trailing_zero_count(m_data); PT_ASSUME(first_bit_index < 32); m_data &= ~(1 << first_bit_index); // clear bit}inline void fixed_int_list_32::pop_back(){ uint32 const last_bit_index = 31 - leading_zero_count(m_data); PT_ASSUME(last_bit_index < 32); m_data &= ~(1 << last_bit_index); // clear bit}
Given 64 byte cache line, 16 objects will fit. 512 indices per cache line.
Oh and then there is memory fragmentation. It's a complex issue, in which I feel I'm just a beginner.