Memory allocator. Maintaining free list.

Started by
6 comments, last by TheChubu 9 years, 10 months ago

Hi! I have some issues trying to implement a (sort of) memory allocator, it has to be tailored to the peculiarities of Java.

First and foremost, I'll describe what I have in Java instead of malloc() and friends:

Direct ByteBuffers, they're an object representing a chunk of memory outside JVM's heap, so it can't be moved by GC. They're used for interacting with native libraries (such as OpenGL, or Bullet). ByteBuffers expose methods for putting things into the buffer, they have a capacity and with some magical casting, you can free them manually.

ByteBuffers, unlike Java arrays, can be sliced via "byteBuffer.slice()" method. You get a ByteBuffer slice starting at the current position mark of the ByteBuffer, and a capacity equal to the remaining bytes in the backing buffer.

So what I wanted to do is the following:

Have a BufferAllocator that would allocate a big backing buffer (couple of MBs), and that it would manage slices of the backing buffer, so to avoid allocating/deallocating small buffers all over the codebase. "capacity" field of the slice can't be modified, so I modify it by reflection, so the user of the returned slice can't step on the memory that lies beyond their slice.

The ByteBuffer doesn't "knows" how much of it is being used. It only has a position mark, a limit mark, and a capacity. You write data into it however you want. So to manage the backing buffer I need an additional structure that can represent which parts of it are free and which parts of it are being used.

I decided to try a simple ArrayList of int[] arrays of size 2 representing the "free block list". First element of the int array is the offset of the block, second element is the end of the block.

Say that I want to ask for a 32 byte buffer, the allocator has 1KB capacity.

The allocator starts with a free block list containing a single element, a int[] with 0 as start value, and 1024 as the end value.

The allocator finds the smallest free block that can hold the 32 byte buffer it was asked for (ie, the only free block we have).

The allocator makes a slice starting at the free block start, ending at the 32 byte mark.

There is still a single contiguous free block, but now it starts at the 32 byte mark, and ends at the 1024 byte mark.

A ByteBuffer slice with capacity 32 is returned.

Have in mind this free block list is just a list of values to do the free block bookkeeping, not the actual memory.

Now, to "free" a ByteBuffer slice, I'd need to find roughly its position inside the free block list.

If there is a neighbor free block (ie, the freed block isn't between two used blocks), I'd need to modify its start/end offsets as required.

Otherwise, create a new int[] instance that holds the freed block's start and end offsets, then insert it into the free block list.

The problem is that if the block I'm trying to free is between two free blocks, I'd need to merge all 3 (left free block, the block I'm freeing, and the right free block). That would mean, removing one of the adjacent free blocks, and expanding the remaining one to cover the block I removed and the block I'm freeing.

Both things: I'm not sure how to do that without a chain of 10 if/else statements (hell, I even tried with lots of if/else statements and I still can't get it to work). And I'm obviously not making the most efficient lookup for finding free blocks to slice.

Here is my (broken) implementation so far:

http://paste.ofcode.org/3rJEwJtLx2L96KVBX8WX5e

Any help?

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

Advertisement

Now, to "free" a ByteBuffer slice, I'd need to find roughly its position inside the free block list.

You mean in the used block list. How do you plan to make things work roughly? I suspect I'm not getting the picture. I use a dual-list for my allocator.

The problem is that if the block I'm trying to free is between two free blocks, I'd need to merge all 3 (left free block, the block I'm freeing, and the right free block). That would mean, removing one of the adjacent free blocks, and expanding the remaining one to cover the block I removed and the block I'm freeing.

In my experience, found "forward merging" appeared to be sufficient. Keep in mind free memory is unreferenced, "useless" memory, so you can really do this every say 1024 free or every time an alloc fails. Making alloc performance nondeterministic doesn't sound like much of a big problem to me.

Previously "Krohm"

FWIW:

for allocating small objects (generally under 1-4kB), I prefer to use bitmaps over fixed-size cells instead of linked-lists.

pros (vs a linked-list allocator):

they tend to have a smaller per-object overhead (no internal pointers needed, but a constant per-space overhead for the bitmaps);

they will implicitly merge adjacent free memory once memory is freed;

object lookup from arbitrary pointers can be made very fast (good for things like dynamic type-checking and similar, which can be made O(1));

basic logic is fairly simple;

easier to add logic to detect/analyze overruns (they will corrupt object data, but will not necessarily cause critical damage to heap metadata, which may be stored elsewhere in memory).

cons:

direct allocation/freeing becomes linearly slower depending on object sizes (need to scan/update bitmaps);

high overheads for larger objects (due using lots of bitmap entries / cells);

more limited "effective range" of object sizes (vs being similarly effective over a wide range of sizes, as-is the case for lists).

partial solutions:

use a pool allocator to speed up allocation/freeing (the pool allocator then serves as the primary allocator, and uses the bitmap allocator for backing, where the pool-allocator has free-lists of fixed-size items);

use several different strategies, picking the one most appropriate for the given object size.

for example, in my current main allocator, it works sort of like this (16 byte cells):

* under 16 bytes: use a slab-based allocator (say, allocate memory for items in groups of 256 and pad them to a power-of-2 size);

* 16-6143 bytes: use bitmap allocator (with a pool allocator to make it faster);

* 6kB+: use a "large object" heap, which manages memory objects mostly via sorted arrays pointers and object headers (mostly using binary lookup and binary-insert).

I had at one point considered adding an additional layer of 256-byte cells which would be used for 4kB to 256kB, but had not done so.

though, yes, a linked list works. a few ideas:

pad object sizes up to multiples of a certain value (ex: 8 or 16 bytes), which will help reduce fragmentation;

use free-lists of fixed allocation sizes (avoids the need for searching), or size-ranges (reduces searching);

only merge with next item on free, and/or have a special case to walk the heap and merge adjacent heap items as-needed.


You mean in the used block list. How do you plan to make things work roughly? I suspect I'm not getting the picture. I use a dual-list for my allocator.
Its in the code. I do a binary search on the free block list (yes, free block list), which is ordered by the starting position (ie, offset into the backing buffer) of the free blocks, to find where the used block would be placed if it were a free block. From there I can determine whats the next free block and the previous free block, if there is any.

Free block lists only contains a struct (actually, a int[] but implementation details...) that contains the start position of the free block and the end position of the free block. I don't keep track of the used blocks, just the free ones.


In my experience, found "forward merging" appeared to be sufficient. Keep in mind free memory is unreferenced, "useless" memory, so you can really do this every say 1024 free or every time an alloc fails. Making alloc performance nondeterministic doesn't sound like much of a big problem to me.

You mean like if a free block wasn't found, do a merging pass on the free block list, then try to find again a new block?

@BGB thanks for the ideas! I'm looking for the specific implementation of free block merging in a free block list though. I know there are other allocators, I just wanted to use the most straightforward one that allows me to make arbitrary allocations (and its inner workings are more or less translatable to Java).

Slicer provided me with an implementation of the kind of allocator I'm making, I'll try to code that one in Java and see if I get it working.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

Something else that may help is if you trade space for speed, and allocate pre-sized blocks of ints, say 32 or 64, and keep all blocks the same size. If objects request a number of blocks of memory and they handle them internally it becomes a simple problem. It would take more size of course because there'd be unused numbers, but its a thought.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

Something else that may help is if you trade space for speed, and allocate pre-sized blocks of ints, say 32 or 64, and keep all blocks the same size. If objects request a number of blocks of memory and they handle them internally it becomes a simple problem. It would take more size of course because there'd be unused numbers, but its a thought.

(if I understand correctly) this is basically what I had meant by a slab allocator.

a slab allocator can be fast, but also reduce memory use by allowing multiple objects to share the same object-header (all of them will implicitly have the same logical-type and share other details).

for 4 byte items, you would allocate space for 256 items (1KiB), and then add them to a free-list of 4 byte items.

for 8 or 16 byte items, it would be similar, just allocating 2KiB or 4KiB instead.

then, for a range of items, you would use a number of free-lists of fixed-size members, say, using something like:

nb=(sz+15)>>4;

to map the size to a specific free-list.

while this isn't perfect, allocations/frees can generally be made pretty fast (in the case where the free-lists aren't empty).

allocation in this case basically looks like:

if(mm_freelist[nb])

{

ptr=mm_freelist[nb];

mm_freelist[nb]=*(void **)ptr;

return(ptr);

}

... fall back to general-case allocation ...

faster is possible via a few tricks, namely having functions hard-coded for a specific allocation/free size, and using separate per-thread contexts (to avoid a need for locking during free-list allocations). for a general-case allocator, it is usually necessary to use a lock to prevent threads from stomping on each other (and for operations which work with or update the heap).

another trick here is to use pre-made "allocation handles", where one first creates a handle-object describing the type of allocation (size, type, flags, ...), and then does allocations/frees using this handle. internally, the handle may contain function-pointers for the specific allocation/free logic, and some data members that may be used for special free-lists or other things.

these may or may not use slabs, but will typically use a pool-allocation strategy (where dedicated free-lists exist for specific types of object, ...).

this may not necessarily apply to "general" code, but in my case code sometimes pops up which requires high-speed allocation/freeing of typically-small objects (often with specific type-information and special semantics/behaviors).

likewise, these large numbers of small objects comprise a big part of the heap (there are often millions of them in dumped memory-stats, 16-64 byte objects comprising the vast majority of these), vs the relatively small number of significantly-larger allocations (thousands of kB range objects, and tens of MB-range objects).

keeping the object-header small is generally also a goal.

currently, my MM/GC uses a 16-byte header, which basically holds:

the object type-ID (objects in my main MM each have a qualified type, externally represented as a string giving a type-name);

the object allocation size;

data about where the allocation-call was made (useful for debugging);

reference-counts (for ref-counted data);

some flags and other things;

a check-value (used to help detect if the header was corrupted, say by an overrun, *).

a lot of this is basically bit-packed, with a few special header sub-variants (such as for objects currently in the free-list, ...).

it was originally an 8-byte header, but was expanded to 16 partly as:

I wanted to support larger objects (the same header is used by large-objects as well, and would otherwise limit allocations in 64-bit apps to 256MB);

needed space for additional metadata;

many objects require 16-byte alignment anyways (for SSE/etc);

...

*: detecting corrupt headers is useful for things like heap forensics, where the MM may detect that a header is corrupt and try to dump diagnostic information and try to detect which memory object overflowed or underflowed, and try to make a guess as to where in the codebase to start looking for the offending code, in addition to raising an exception. also, we may free an object and it helps the MM to detect if a given object overran its bounds without requiring additional memory for additional check-values.

though, say, using 12-bits for a check-value, there is a 1/4096 chance for trashed-headers to escape notice, but this can be reduced some by use of additional "sanity checks" (example: does the size in the header fall within the range of what could fit within the given memory-cells and is appropriate for this part of the heap; is the object type-ID and source-ID valid?), but this needs to be balanced with performance (leaving it more as something for heap-analysis).

These things always get more complicated whey you start trying to save/load game state in files for later use. I think the header idea is great.

Whenever I find myself with this kind of problem, I back up and ask "What problem am I trying to solve." Sometimes I get so carried away with the solution I forget to monitor whether or not I'm creating an "optimized" solution without a problem.


Hi! I have some issues trying to implement a (sort of) memory allocator, it has to be tailored to the peculiarities of Java.

What are you trying to do, write really cool software or something? Sheesh.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532


What are you trying to do, write really cool software or something? Sheesh.
:P IIRC, you said you used LWJGL right? Well, for functions that need an array of things or simply blocks of data, you have to use direct (off-JVM-heap) ByteBuffers, since you can't pass Java heap memory to a native library (GC will move it around, and there is more stuff in it like JVM's bookkeeping). I wanted to make a single ByteBuffer allocator instead of having direct ByteBuffers declared everywhere I needed them (mesh loaders, texture loaders, UBO uploader, renderer, etc), I'd only need to pass an allocator with a single backing direct buffer (ie, allocate a 16Mb buffer, and use that one everywhere in the program).

In any case, thank you all for your suggestions! Slicer helped me with the free block merging code, here is the allocator:

http://paste.ofcode.org/UZhPFBn23XxVQx3wsNqx79

And for anyone trying to use ByteBuffer slices to pass around to native libraries, a gotcha: ByteBuffer.slice() returns a ByteBuffer with BIG_ENDIAN ordering, no matter if the "parent" buffer is LITTLE_ENDIAN. So you gotta set slice's ordering to whatever the native ordering is in your platform (you can see it in line 176).

BTW, have in mind I use a little reflection to set the 'capacity' field of the slices so they don't overlap contiguously reserved blocks. It is slow yes but there is no other way to do it, all DirectByteBuffer constructors are package private. Hopefully by isolating it in its own "setCapacity" static final method, HotSpot can optimize it as much as possible.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

This topic is closed to new replies.

Advertisement