Cache misses when using a memory pool

Started by
23 comments, last by Joni-Matti 13 years, 5 months ago
Quote:Original post by Joni-Matti
I've always understood that it decreases performance. In my case, using timeBeginPeriod(1) improves performance, which I don't understand. Here are some numbers of one of my benchmarks for the memory pool version:

timeGetTime: 810 ms
QPC: 807 ms
timeGetTime+timeBeginPeriod(1): 602 ms
QPC+timeBeginPeriod(1): 606 ms

QPC+timeBeginPeriod(1) is just for reference. I know that timeBeginPeriod(1) doesn't have any effect on QPC, but it was quite weird that still the program execution time improved. This is the first time I see this happening. I've used timeGetTime() and QPC for a long time and this has occurred never before. Oh, and even adding cin.get(); to the end of the main() resulted that performance decreased (from 607 ms to 813 ms).

It would be good to know which timer or other profiling mechanism to trust, so that I can concentrate on the cache problem.

timeBeginPeriod() *globally* sets the system quantum so it also affects all other applications running at the moment. This is so abhorrent I wonder why it's allowed!
Advertisement

I am not sure if the issue was solved already, but I'd check if there is an memory alignment issue here.

as far as I know new char[...] doesn't necessarily align in the best way.

Cheers!
Quote:Original post by Antheus
Ignoring the fact that the test seems to be wonky in the first place, I would design such allocator differently:
<code snipped removed>


I tried your allocator and it improved performance. Now the memory pool version was slightly better than new/delete version. However, the memory usage was too high. The reason for the better performance and larger memory usage was these lines of code:

size_t byte_size = count * sizeof(T);size_t page_size = (byte_size / cache_line_size + 1) * cache_line_size;pool_size = page_size * count;


Could you explain more thoroughly how this code works? If count = 32, sizeof(T) = 1024 and cache_line_size = 64, then:
byte_size = 32768
page_size = 32832
pool_size = 1050624

So instead of 32 kB of memory usage, the pool consumes ~1MB. This code will eventually be run in embedded devices with limited memory resources, so wasting so much memory is not acceptable.

If I removed the chunk cache line optimization which you did with the page_size variable, the performance was the same as with my memory pool.

Quote:
Chunks in use are managed via a stack. There isn't much reason to use linked lists. This also results in somewhat safer design, since data manipulated by user will never be overwritten by allocator so it's safe even if invalid pointer is passed into free().


Why should a stack be preferred over a linked list? It uses more memory (although in this case it is not so much). I would understand it if the linked list would be formed from other memory than the actual buffers. The pointer validation check inside free() can also be done in my version as easy as in your version, so how would the stack version be more safer? (Usually I have done that check, but this time omitted it from the memory pool.)

class MemoryPool{public:    // Constructor which initializes the memory pool.    MemoryPool(unsigned int size, unsigned int count);    // Destructor.    ~MemoryPool();        // Allocates one buffer from the pool.    inline unsigned char* allocate();        // Frees the given buffer back to the pool.    inline void free(unsigned char* buf);    private:    // Buffer node structure for storing the buffers.    struct BufferNode    {        BufferNode* next;    };        // The actual memory allocation block.    unsigned char* m_allocation;              unsigned char* m_allocationEnd;        // Free buffers list.    BufferNode* m_freeBuffers;};MemoryPool::MemoryPool(unsigned int size, unsigned int count) : m_allocation(NULL), m_freeBuffers(NULL){    // Allocate memory.    unsigned int allocSize = size * count;    m_allocation = new unsigned char[allocSize];    m_allocationEnd = m_allocation + allocSize;        // Split it into buffers and form the linked list.    unsigned char* curBuffer = m_allocation;        for (unsigned int i = 0; i < count; ++i)    {        BufferNode* node = reinterpret_cast<BufferNode*>(curBuffer);        node->next = m_freeBuffers;        m_freeBuffers = node;                curBuffer += size;    }}MemoryPool::~MemoryPool(){    delete[] m_allocation;}inline unsigned char* MemoryPool::allocate(){    assert(m_freeBuffers != NULL);    // Take a buffer from the front of the list.    unsigned char* buf = reinterpret_cast<unsigned char*>(m_freeBuffers);    m_freeBuffers = m_freeBuffers->next;    return buf;}inline bool MemoryPool::free(unsigned char* buf){    assert(buf != NULL);    if (buf >= m_allocation && buf < m_allocationEnd)    {        // Add the buffer to the front of the list.        BufferNode* node = reinterpret_cast<BufferNode*>(buf);        node->next = m_freeBuffers;        m_freeBuffers = node;        return true;    }    return false;}


Quote:Original post by kauna

I am not sure if the issue was solved already, but I'd check if there is an memory alignment issue here.

as far as I know new char[...] doesn't necessarily align in the best way.

Cheers!


I already tried _aligned_malloc() with 16, 32 and 64 alignments and there was no change in performance. Also, the allocator proposed by Antheus performed equally well with new compared to _aligned_malloc().

I finally found out why timeBeginPeriod(1) improved performance with the VC compiler. It was caused by VS2008's Whole Program Optimization. If I disabled it, timeBeginPeriod(1) didn't have any effect on the execution time. Now, without whole program optimization, execution time for the one test case was always ~600 ms with any timer. I guess that the whole program optimization didn't perform quite well in this situation.
Quote:Original post by Ashkan
timeBeginPeriod() *globally* sets the system quantum so it also affects all other applications running at the moment. This is so abhorrent I wonder why it's allowed!
It's allowed because sometimes you just need either more accurate timing or more accurate sleeping resolution than "something around 16ms, maybe more, who knows".
Also, timeGetTime is the only reliable timer function available, but its default resolution is unusable. timeBeginPeriod fixes that problem.

While it is true that timeBeginPeriod affects all processes running on the system, this is not as horrend as you depict it. The scheduler will use the minimum requested by any process, so the worst thing to happen is that a few threads in other programs wake up closer to what they request than they would otherwise. A thread that has a 20ms timeout on WaitForSingleObject or Sleep might wake up after 20 or 21 millisecond now, instead after 32 milliseconds.
That means a few more context switches, so what. Note that changing the timer resolution to 1/16 of its standard value does not automatically mean 16 times as many context switches (though, in the worst case, if every thread was doing Sleep(1) all the time, that could be).

Nevertheless, the overhead by those context switches is just ridiculous compared to the crap that every new Windows version has running in the background, which nobody notices at all. It is also ridiculous compared to the overhead generated by every mainstream virus scanner, which most people don't notice either.

Other programs calling timeGetTimewill still get time values after you called timeBeginPeriod, too. They will in principle be the exact same values, except being more accurate than usually (so, for example, incrementing in steps of 1 instead of being rounded down to the closest 16), thus there is really no harm.
Quote:Original post by Antheus

Potential performance benefit of pooling is secondary. Memory pools are used for different purposes:
- guaranteed some minimum size of available memory (embedded or otherwise resource constrainted environments)
- reducing or eliminating fragmentation (especially if there is no virtual memory)


These were the main reasons why I decided to use a memory pool. The performance improvement would be just a plus. Nevertheless, I still don't want worse performance than with new/delete.
Quote:Original post by Joni-Matti

So instead of 32 kB of memory usage, the pool consumes ~1MB. This code will eventually be run in embedded devices with limited memory resources, so wasting so much memory is not acceptable.

Ugh...

size_t byte_size = sizeof(T);size_t page_size = cache_line_size;if (byte_size > cache_line_size) {  page_size = ((byte_size - 1) / cache_line_size + 1) * cache_line_size;} pool_size = page_size * count;
Round to nearest multiple of cache line. The maximum overhead should be (cache_line_size-1) bytes per element. As said, this is intended for large blocks. Small block allocators wouldn't do this. If size of element is a power of 2, there is no overhead.

Quote:Why should a stack be preferred over a linked list?

See here (ppt) on Hot/Cold splitting.

Allocator operates on considerably smaller chunk of memory, just the stack (for n=32 that is a total of 128 or 256 bytes, or 2-4 cache lines). This increases chance of operating completely in L1. Free() is also very unlikely to cause cache miss in case buffer being freed went out of cache by the time it's called. When calling allocate(), if application doesn't immediately write to the block, this version avoids mandatory write to a block that is very likely to not be in cache.

Quote:I already tried _aligned_malloc() with 16, 32 and 64 alignments and there was no change in performance.
I just put that in there to be sure. Usually malloc will align already, but some implementations can be a bit silly. It's not an optimization as such, more of a guarantee.
Quote:Original post by Joni-Matti
Nevertheless, I still don't want worse performance than with new/delete.
Ok, but the stack allocator posted by Antheus boils down to approximately one pointer decrement and one move instruction when full compiler optimizations are used, so this is at least several dozen times faster than operator new (and possibly a hundred times faster in the worst case). There is no reason for caching issues being more prevalent than in the standard allocator, rather the opposite. So, if your timings show something different, then one can only conclude that your timings are wrong.

Quote:Original post by Antheus
I just put that in there to be sure. Usually malloc will align already, but some implementations can be a bit silly. It's not an optimization as such, more of a guarantee.
And rightly so, because the guarantee given by malloc is only alignment to the largest POD type, which is 8 bytes. Maybe MSVC and a few other compilers align on 16-byte boundaries by default, but gcc certainly does not.
Quote:Original post by Joni-Matti
[
So instead of 32 kB of memory usage, the pool consumes ~1MB. This code will eventually be run in embedded devices with limited memory resources, so wasting so much memory is not acceptable.


Yeah and this type of performance tuning on a system that is entirely unlike the intended device is close to pointless.
I've fiddled some with this issue like a year ago, and the gain was around 20x the std new/delete in speed (and x5 smaller memory consumption with std::list<void*>) with single threaded profiling. It wasn't for generic use and meant for use with multithreading. Though I didn't finish it, I'll share some experience..

First you have the data, which you will want to divide into fixed size pages. I had aligned pages of 64k elements with element size of 4, 16 and 64 bytes.

For each page you need to track which elements are used or free. I call this free index list. You want to separate the index list and the data it points to, so you dont have to load data in cache.

The control data and the actual free index list contents was separated for better cache usage. With control data I mean fast way to tell free count for a free index list (to block of memory). I had collection of these index lists and collection of their control data. So when you iterate the collection of these list, multiple lists are in one cache line.

You want to divide these pages into collections of full, empty and partially used. When you allocate and search for free space you can skip full pages. When you deallocate you can skip the empty pages. Also don't sync the pages on every deallocation, since most propably you're in std::list destructor and you are deallocating alot in a row.

The data pages with different sizes was to reduce the overhead with the free index lists. For larger allocations you use larger pages.

The free index lists are implemented by varying sized specialized hand optimized lists, all using the following code as the base. This is 32 element single linked list which can only hold integers 0 - 31. It is branchless and always ordered, and fits in a 32bit register so you can iterate it without no memory operations. Includes are not included..

#include "popcount.h"#include <cassert>#define PT_INLINE __forceinline#define PT_ASSUME(e) (__assume(e))typedef unsigned char uint8;typedef unsigned short uint16;typedef unsigned int uint32;/// List of integers, fixed capacity of 32 elements.////// Consumes 4 bytes of memory, each node is one bit. Whole list fits in a 32/// bit register to enable compiler to apply heavy optimizations.////// All member functions are branchless, no if's or loops./// Inlines to few machine instructions. ie constructor is single OR or XOR/// instruction. There isn't any, or very few immediate constant loads.///class fixed_int_list_32{public:    typedef uint32 value_type;    enum filled_type{ filled };    class iterator;    /// Constructs new empty list.    fixed_int_list_32();    /// Constructs new list filled with indices 0 to 31.    fixed_int_list_32(filled_type);    bool empty() const;    bool full() const;    uint8 size() const;    iterator begin();    iterator end();    iterator iterator_at(uint32 index);    void insert(uint32 index);    void insert_2(value_type start_index);    void insert_4(value_type start_index);    /// Returns true if next 2 elements from index are available.    bool next_2_available(value_type index) const;    /// Returns true if next 4 elements from index are available.    bool next_4_available(value_type index) const;    /// Returns iterator which is greater than or equal to \a index.    iterator find_ge(uint32 index) const;    iterator erase(uint32 index);    iterator erase(iterator it);    void pop_front();    void pop_back();private:private: // data members    uint32 m_data;};class fixed_int_list_32::iterator{public:    typedef uint32 value_type;    iterator(uint32 index, uint32 const& data)    : m_data(&data)    , m_index(index)    {    }    value_type operator*() const    {        return static_cast<value_type>(m_index);    }    iterator& operator++()    {        // Index of 32 is end iterator, and it will zero out the data at        // shift.        uint32 const data = *m_data;        // Shift out one past the current bit towards LSB.        // For some reason >> 32u doesnt clear MSB, so mask out MSB.        uint32 const shifted_data = (data >> (m_index + 1u))            & ~uint32(0x80000000);        // Count zeros to skip.        uint32 const skip_count = trailing_zero_count(shifted_data);        // m_index = (skip_count < 32) ? index + 1 + skip_count : 32;        uint32 const next_index = m_index + 1u + skip_count;        int32 const a = 31 - skip_count;        // index = (a >= 0) ? next_index : 32;        m_index = select(a, next_index, 32);        return *this;    }    bool operator!=(iterator const& other) const    {        return m_index != other.m_index            || m_data != other.m_data;    }    bool operator==(iterator const& other) const    {        return m_index == other.m_index            && m_data == other.m_data;    }    bool is_end() const    {        return m_index == 32u;    }private: // data members    uint32 const* __restrict m_data;    uint32 m_index;};// .inlinline fixed_int_list_32::fixed_int_list_32(): m_data(0){}inline fixed_int_list_32::fixed_int_list_32(filled_type): m_data(~static_cast<uint32>(0)){}inline bool fixed_int_list_32::empty() const{    return m_data == 0;}inline bool fixed_int_list_32::full() const{    return m_data == ~static_cast<uint32>(0);}inline uint8 fixed_int_list_32::size() const{    return popcount_wiki_3(m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::begin(){    uint32 const first_bit_index = trailing_zero_count(m_data);    return iterator(first_bit_index, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::end(){    return iterator(32, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::iterator_at(uint32 index){    return iterator(index, m_data);}inline void fixed_int_list_32::insert(uint32 index){    PT_ASSUME(index < 32);    m_data |= 1 << index; // set bit}inline void fixed_int_list_32::insert_2(value_type start_index){    PT_ASSUME(start_index < 31);    m_data |= 3 << start_index;}inline void fixed_int_list_32::insert_4(value_type start_index){    PT_ASSUME(start_index < 29);    m_data |= 0x0f << start_index;}inline bool fixed_int_list_32::next_2_available(value_type index) const{    // return true if two bits are set    return ((m_data >> index) & 3u) == 3u;}inline bool fixed_int_list_32::next_4_available(value_type index) const{    // return true if four bits are set    return ((m_data >> index) & 15u) == 15u;}/// Returns iterator which is greater than or equal to \a index.inline fixed_int_list_32::iterator fixed_int_list_32::find_ge(uint32 index) const{    assert(index <= 32);    PT_ASSUME(index <= 32);    // on an i386 "x shr/shl 32" = "x"    // ie m_data >> 32 must have msb masked out.    uint32 const mask = 0xffffffff >> (index - 1);    uint32 const masked_bits = (m_data >> index) & mask;    uint32 const first_bit_index = trailing_zero_count(masked_bits);    int32 const next_index = index + first_bit_index;    uint32 const clamped_next_index = branchless_min(next_index, int32(32));    return iterator(clamped_next_index, m_data);}inline fixed_int_list_32::iterator fixed_int_list_32::erase(uint32 index){    PT_ASSUME(index < 32);    fixed_int_list_32::iterator const next = find_ge(index + 1);    m_data &= ~(1 << index); // clear bit    return next;}inline fixed_int_list_32::iterator fixed_int_list_32::erase(iterator it){    uint32 const index = *it;    ++it;    m_data &= ~(1 << index); // clear bit    return it;}inline void fixed_int_list_32::pop_front(){    uint32 const first_bit_index = trailing_zero_count(m_data);    PT_ASSUME(first_bit_index < 32);    m_data &= ~(1 << first_bit_index); // clear bit}inline void fixed_int_list_32::pop_back(){    uint32 const last_bit_index = 31 - leading_zero_count(m_data);    PT_ASSUME(last_bit_index < 32);    m_data &= ~(1 << last_bit_index); // clear bit}


Given 64 byte cache line, 16 objects will fit. 512 indices per cache line.

Oh and then there is memory fragmentation. It's a complex issue, in which I feel I'm just a beginner.
Quote:Original post by Ftn
First you have the data, which you will want to divide into fixed size pages. I had aligned pages of 64k elements with element size of 4, 16 and 64 bytes.

This is a valid and common design for small block allocator.

But when working with large elements usage characteristics change.

It becomes closer to how SQL (or similar) databases are designed. Index is kept mostly in memory, data itself is mostly on disk.

Quote:Yeah and this type of performance tuning on a system that is entirely unlike the intended device is close to pointless.
I wrote the pool allocation code incorrectly. There is no overhead if element size is multiple of cache line size.

This topic is closed to new replies.

Advertisement